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Preface 



The aim of the annual Workshop on Algorithm Engineering and Experiments 
(ALENEX) is to provide a forum for the presentation of original research in the 
implementation and experimental evaluation of algorithms and data structures. 
ALENEX 2001, the third in the series, was held in Washington, DC, on January 
5-6, 2001. This volume collects extended versions of the 15 papers that were 
selected for presentation from a pool of 31 submissions. It also includes the 
abstracts from the three invited speakers, who were supported by DIMACS 
Special Focus on Next Generation Networks. 

We would like to take this opportunity to thank the sponsors, authors, and 
reviewers that made ALENEX 2001 a success. We would also like to thank 
Springer- Verlag for publishing these papers in their series of Lecture Notes in 
Computer Science. 
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Abstract. We consider geometric instances of the Maximum Weighted 
Matching Problem (MWMP) and the Maximum Traveling Salesman 
Problem (MTSP) with up to 3,000,000 vertices. Making use of a geo- 
metric duality relationship between MWMP, MTSP, and the Fermat- 
Weber-Problem (FWP), we develop a heuristic approach that yields in 
near-linear time solutions as well as upper bounds. Using various compu- 
tational tools, we get solutions within considerably less than 1% of the 
optimum. 

An interesting feature of our approach is that, even though an FWP 
is hard to compute in theory and Edmonds’ algorithm for maximum 
weighted matching yields a polynomial solution for the MWMP, the prac- 
tical behavior is just the opposite, and we can solve the FWP with high 
accuracy in order to find a good heuristic solution for the MWMP. 



1 Introduction 

Complexity in Theory and Practice. In the field of discrete algorithms, the clas- 
sical way to distinguish “easy” and “hard” problems is to study their worst-case 
behavior. Ever since Edmonds’ seminal work on maximum matchings jYliSj . the 
adjective “good” for an algorithm has become synonymous with a worst-case 
running time that is bounded by a polynomial in the input size. At the same 
time, Edmonds’ method for finding a maximum weight perfect matching in a 
complete graph with edge weights serves as a prime example for a sophisticated 
combinatorial algorithm that solves a problem to optimality. Furthermore, find- 
ing an optimal matching in a graph is used as a stepping stone for many heuristics 
for hard problems. 



A.L. Buchsbaum and J. Snoeyink (Eds.): ALENEX 2001, LNCS 2153, pp. 1-^^ 2001. 
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The classical prototype of such a “hard” problem is the Traveling Salesman 
Problem (TSP) of computing a shortest roundtrip through a set P of n cities. 
Being NP-hard, it is generally assumed that there is no “good” algorithm in the 
above sense: Unless P=NP, there is no polynomial-time algorithm for the TSP. 
This motivates the performance analysis of polynomial-time heuristics for the 
TSP. Assuming triangle inequality, the best polynomial heuristic known to date 
uses the computation of an optimal weighted matching: Christofides’ method 
combines a Minimum Weight Spanning Tree (MWST) with a Minimum Weight 
Perfect Matching of the odd degree vertices, yielding a worst-case performance 
of 50% above the optimum. 

Geometric Instances. Virtually all very large instances of graph optimization 
problems are geometric. It is easy to see why this should be the case for practical 
instances. In addition, a geometric instance given by n vertices in IR'^ is described 
by only dn coordinates, while a distance matrix requires entries; even with 

today’s computing power, it is hopeless to store and use the distance matrix for 
instances with, say, n = 10®. 

The study of geometric instances has resulted in a number of powerful theo- 
retical results. Most notably, Arora | 2 | and Mitchell m have developed a general 
framework that results in polynomial time approximation schemes (PTASs) for 
many geometric versions of graph optimization problems: Given any constant e, 
there is a polynomial algorithm that yields a solution within a factor of (1 -I- e) 
of the optimum. However, these breakthrough results are of purely theoretical 
interest, since the necessary computations and data storage requirements are 
beyond any practical orders of magnitude. 

For a problem closely related to the TSP, there is a different way how ge- 
ometry can be exploited. Trying to find a longest tour in a weighted graph is 
the so-called Maximum Traveling Salesman Problem (MTSP); it is easy to see 
that for graph instances, the MTSP is just as hard as the TSP. Making clever 
use of the special geometry of distances, Barvinok, Johnson, Woeginger, and 
Woodroofe ^ showed that for geometric instances in 1R‘^, it is possible to solve 
the MTSP in polynomial time, provided that distances are measured by a poly- 
hedral metric, which is described by a unit ball with a fixed number 2/ of facets. 
(For the case of Manhattan distances in the plane, we have / = 2, and the 
resulting complexity is logn) = O(n^logn).) By using a large enough 

number of facets to approximate a unit sphere, this yields a PTAS for Euclidean 
distances. 

Both of these approaches, however, do not provide practical methods for 
getting good solutions for very large geometric instances. And even though TSP 
and matching instances of considerable size have been solved to optimality (up to 
13,000 cities with about 2 years of computing time 0), it should be stressed that 
for large enough instances, it seems quite difficult to come up with small gaps 
within a very short (i.e., near-linear in n) time. Moreover, the methods involved 
only use triangle inequality, and disregard the special properties of geometric 
instances. 
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For the Minimum Weight Matching problem, Vaidya m showed that there is 
algorithm of complexity 0{n^'^ log"^ n) for planar geometric instances, which was 
improved by Varadarajan to 0(n^-® log® n). Cook and Rohe |0| also made 
heavy use of geometry to solve instances with up to 5,000,000 points in the 
plane within about 1.5 days of computing time. However, all these approaches 
use specific properties of planar nearest neighbors. Cook and Rohe reduce the 
number of edges that need to be considered to about 8,000,000, and solve the 
problem in this very sparse graph. These methods cannot be applied when try- 
ing to find a Maximum Weight Matching. (In particular, a divide-and-conquer 
strategy seems unsuited for this type of problem, since the structure of furthest 
neighbors is quite different from the well-behaved “clusters” formed by nearest 
neighbors.) 

Heuristic Solutions. A standard approach when considering “hard” optimization 
problems is to solve a closely related problem that is “easier” , and use this solu- 
tion to construct one that is feasible for the original problem. In combinatorial 
optimization, finding an optimal perfect matching in an edge-weighted graph 
is a common choice for the easy problem. However, for practical instances of 
matching problems, the number n of vertices may be too large to find an exact 
optimum in reasonable time, since the best complexity of an exact algorithm is 
0{n{m + nlogn)) jH] (where m is the number of edges(0. 

We have already introduced the Traveling Salesman Problem, which is known 
to be NP-hard, even for geometric instances. A problem that is hard in a different 
theoretical sense is the following: For a given set P of n points in the Fermat- 
Weber Problem (FWP) is to minimize the size of a “Steiner star”, i.e., the total 
Euclidean distance S{P) — mincgjR d{c,p) of a point c to all points in P. 

It was shown in P] that even for the case n = 5, solving this problem requires 
finding zeroes of high-order polynomials, which cannot be achieved using only 
radicals. 

Solving the FWP and solving the geometric maximum weight matching prob- 
lem (MWMP) are closely related: It is an easy consequence of the triangle in- 
equality that MWMP(P) < FWP(P). For a natural geometric case of Euclidean 
distances in the plane, it was shown in jH!] that FWP(^PJ/MWMP(PJ < 2/-\/3 ~ 
1.15. 

From a theoretical point of view, this may appear to assign the roles of 
“easy” and “hard” to MWMP and FWP. However, from a practical perspective, 
roles are reversed: While solving large maximum weight matching problems to 
optimality seems like a hopeless task, finding an optimal Steiner center c only 
requires minimizing a convex function. Thus, the latter can be solved very fast 
numerically (e.g., by Newton’s method) within any small e. The twist of this 
paper is to use that solution to construct a fast heuristic for maximum weight 
matchings - thereby solving a “hard” problem to approximate an “easy” one. 
Similar ideas can be used for constructing a good heuristic for the MTSP. 

^ Quite recently, Mehlhorn and Schafer PD have presented an implementation of this 
algorithm; the largest dense graphs for which they report optimal results have 4,000 
nodes and 1,200,000 edges. 
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Summary of Results. It is the main objective of this paper to demonstrate that 
the special properties of geometric instances make them much easier in practice 
than general instances on weighted graphs. Using these properties gives rise to 
heuristics that construct excellent solutions in near-linear time, with very small 
constants. Since the analytic worst-case ratio of FWP(P)/MWMP(P) is only 
2/\/3 ~ 1.15, it is certain that the difference to the optimum will never exceed 
15%, but can be expected to be much less in practice. 

1. This is validated by a practical study on instances up to 3,000,000 points, 
which can be dealt with in less than three minutes of computation time, 
resulting in error bounds of not more than about 3% for one type of instances, 
but only in the order of 0.1% for most others. The instances consist of the 
well-known TSPLIB, and random instances of two different random types, 
uniform random distribution and clustered random distribution. 

To evaluate the quality of our results for both MWMP and MTSP, we employ 
a number of additional methods, including the following: 

2. An extensive local search by use of the chained Lin-Kernighan method yields 
only small improvements of our heuristic solutions. This provides experi- 
mental evidence that a large amount of computation time will only lead to 
marginal improvements of our heuristic solutions. 

3. An improved upper hound (that is more time-consuming to compute) indi- 
cates that the remaining gap between the fast feasible solutions and the fast 
upper bounds is too pessimistic on the quality of the heuristic, since the 
gap seems to be mostly due to the difference between the optimum and the 
upper bound. 

4. A polyhedral result on the structure of optimal solutions to the MWMP allows 
the computation of the exact optimum by using a network simplex method, 
instead of employing Edmonds’ blossom algorithm. This result (stating that 
there is always an integral optimum of the standard LP relaxation for planar 
geometric instances of the MWMP) is interesting in its own right and was 
observed previously by Tamir and Mitchell [IH|. A comparison for instances 
with less than 10,000 nodes shows that the gap between the solution com- 
puted by our heuristic and the upper bound derived from the FWP(P) is 
much larger than the difference between our solution and the actual opti- 
mal value of the MWMP(P), which turns out to be at most 0.26%, even for 
clustered instances. Moreover, twice the optimum solution for the MWMP is 
also an upper bound for the MTSP. For both problems, this provides more 
evidence that additional computing time will almost entirely be used for low- 
ering the fast upper bound on the maximization problem, while the feasible 
solution changes only little. 

In addition, we provide a number of mathematical tools to make the results for 
the MWMP applicable to the MTSP. These results include: 
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5. The worst-case estimate for the ratio between MTSP(P) and FWP(P) is 
slightly worse than the one between MWMP(P) and FWP(P), since there 
are instances where we have FWP(P)/MTSP(P)= 2/(2 -|- -\/2) ~ 0.586 > 
0.577 ~ l/-\/3 > FWP(P)/2MWMP(P). However, we show that for large 
n, the asymptotic worst-case performance for the MTSP is the same as for 
the MWMP. This means that the worst-case gap for our heuristic is also 
bounded by 15%, and not by 17%, as suggested by the above example. 

6. For a planar set of points that are sorted in convex position (i.e., the vertices 
of a polyhedron in cyclic order), we can solve the MWMP and the MTSP in 
linear time. 

The results for the MTSP are of similar quality as for the MWMP. Further 
evidence is provided by an additional computational study: 

7. We compare the feasible solutions and bounds for our heuristic with an 
“exact” method that uses the existing TSP package Concorde for TSPLIB 
instances of moderate size (up to about 1000 points). It turns out that most of 
our results lie within the widely accepted margin of error caused by rounding 
Euclidean distances to the nearest integer. Furthermore, the (relatively time- 
consuming) standard Held-Karp bound is outperformed by our methods for 
most instances. 

2 Minimum Stars and Maximum Matchings 

2.1 Background and Algorithm 

Consider a set P of points in of even cardinality n. The Fermat-Weber Prob- 
lem (FWP) is given by minimizing the total Euclidean distance of a “median” 
point c to all points in P, i.e., FWP(P) = minc^fiJ2pepd{c,p). This problem 
cannot be solved to optimality by methods using only radicals, since it requires 
to find zeroes of high-order polynomials, even for instances that are symmetric 
to the 2 /-axis; see |5|. However, the objective function is strictly convex, so it is 
possible to solve the problem numerically with any required amount of accuracy. 
A simple binary search will do, but there are more specific approaches like the so- 
called Weiszfeld iteration mm . We achieved the best results by using Newton’s 
method. 

The relationship between the FWP and the MWMP for a point set of even 
cardinality n has been studied in mg: Any matching edge between two points pi 
and pj can be mapped to two “rays” (c,pi) and (c,pj) of the star, so it follows 
from triangle inequality that MWMP(P) < FWP(P). Clearly, the ratio between 
the values MWMP(P) and FWP(P) depends on the amount of “shortcutting” 
that happens when replacing pairs of rays by matching edges; moreover, any 
lower bound for the angle 4>ij between the rays for a matching edge is mapped 
directly to a worst-case estimate for the ratio, since it follows from elementary 
trigonometry that d{c,pi) + d{c,pj) < ■ d{pi,pj). See Fig. □ It was 

shown in nm that there is always a matching with ipij > 27t/ 3 for all angles 
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Fig. 1. Angles and rays for a matching edge (pi,pj). 



4>ij between rays. This bound can be used to prove that FWP(P)/MWMP(P) 
< 2/^/3ft: 1.15. 



Algorithm CROSS: Heuristic solution for MWMP 
Input: A set of points P € 

Output: A matching of P. 

1. Using a nnmerical method, find a point c that approximately minimizes 
the convex function min;,g 2 fj 2 ^ d{c,Pi). 

2. Sort the set P by angnlar order around c. Assume the resulting order is 

Pl,...,Pn. 

3. For i = 1, . . . , n/2, match point pi with point Pi+ j . 



Fig. 2. The heuristic CROSS. 



If the above lower bound on the angle can be improved, we get a better 
estimate for the value of the matching. This motivates the heuristic CROSS for 
large-scale MWMP instances that is shown in Fig. 0 See Fig. 0for a heuristic 
solution of the TSPLIB instance dsjlOOO. 

Note that beyond a critical accuracy, the numerical method used in step 1 
will not affect the value of the matching, since the latter only changes when the 
order type of the resulting center point c changes with respect to P. This means 




Fig. 3. A heuristic MWMP solution for the TSPLIB instance dsjlOOO that is within 
0.19% of the optimum. 
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1 — Maximum Matching, 
Value > 4k 
Heuristic Matchi^, 
Value (2k+4)43 



Fig. 4. A class of examples for which CROSS is 15% away from the optimum. 



that spending more running time for this step will only lower the upper bound. 
We will encounter more examples of this phenomenon below. 

The class of examples in Fig. 0 shows that the worst-case relative error es- 
timate of about 15% is indeed best possible, since the ratio between optimal 
and heuristic matching may get arbitrarily close to 2/ -\/3. As we will see further 
down, this worst-case scenario is highly unlikely and the actual error is much 
smaller. 

Furthermore, it is not hard to see that CROSS is optimal if the points are in 
convex position: 

Theorem 1. If the point set P is in strictly convex position, then algorithm 
CROSS determines the unique optimum. 

For a proof, observe that any pair of matching edges must be crossing, oth- 
erwise we could get an improvement by performing a 2-exchange. 



2.2 Improving the Upper Bound 

When using the value FWP(P) as an upper bound for MWMP(P), we compare 
the matching edges with pairs of rays, with equality being reached if the angle 
enclosed between rays is tt, i.e., for points that are on opposite sides of the 
center point c. However, it may well be the case that there is no point opposite 
to a point Pi. In that case, we have an upper bound on maxj (fij, and we can 
lower the upper bound FWP{P). See Fig. 0 the distance d{c,pi) is replaced by 

Moreover, we can optimize over the possible location of point c. This lowers 
the value of the upper bound FWP(P), yielding the improved upper bound 
FWP’(P): 



FWP'{P) 



min 



Pi€P 



mmj^i{d{c,pi) + d{c,pj) - d(p^,pj) 
2 



This results in a notable improvement, especially for clustered instances. 
However, the running time for computing this modified upper bound FWP’(P) 
is superquadratic. Therefore, this approach is only useful for mid-sized instances, 
and when there is sufficient time. 
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Fig. 5. Improving the upper bound. 



2.3 An Integrality Result 

A standard approach in combinatorial optimization is to model a problem as an 
integer program, then solve the linear programming relaxation. As it turns out, 
this works particularly well for the MWMP: 

Theorem 2. Let x he a set of nonnegative edge weights that is optimal for the 
standard linear programming relaxation of the MWMP, where all vertices are 
required to be incident to a total edge weight of 1. Then the weight of x is equal 
to an optimal integer solution of the MWMP. 

This theorem has been observed previously by Tamir and Mitchell [IS]- The 
proof assumes the existence of two fractional odd cycles, then establishes the 
existence of an improving 2-exchange by a combination of parity arguments. 

Theorem |21 allows it to compute the exact optimum by solving a linear pro- 
gram. For the MWMP, this amounts to solving a network flow problem, which 
can be done by using a network simplex method. 

2.4 Computational Experiments 

Table □ summarizes some of our results for the MWMP for three classes of 
instances, described below. It shows a comparison of the FWP upper bound 
with different Matchings: In the first column the CROSS heuristic was used 
to compute the matching. In the second column we report the corresponding 
computing times on a Pentium II 500Mhz (using C code with compiler gcc -03 
under Linux 2.2). The third column gives the result of combining the CROSS 
matching with one hour of local search by chained Lin-Kernighan [ 1 7] . The last 
column compares the optimum computed by a network simplex using Theorem 
El with the upper bound (for n < 10, 000). For the random instances, the average 
performance over ten different instances is shown. 

The first type of instances are taken from the well-known TSPLIB bench- 
mark library. (For odd cardinality TSPLIB instances, we follow the custom of 
dropping the last point from the list.) Clearly, the relative error decreases with 
increasing n. 

The second type was constructed by choosing n points in a unit square uni- 
formly at random. The reader may observe the near-linear running time. It 
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Table 1. Maximum matching results for TSPLIB (top), uniform random (center), and 
clustered random instances (bottom) 



Instance 


CROSS 
vs. FWP 


time 


CROSS -b 
Ih Lin-Ker 


CROSS 
vs. OPT 


dsjlOOO 


1.22% 


0.05 s 


1.07% 


0.19% 


nrwl378 


0.05% 


0.05 s 


0.04% 


0.01% 


fnl4460 


0.34% 


0.13 s 


0.29% 


0.05% 


usal3508 


0.21% 


0.64 s 


0.19% 


- 


brdl4050 


0.67% 


0.59 s 


0.61% 


- 


dl8512 


0.14% 


0.79 s 


0.13% 


- 


pla85900 


0.03% 


3.87 s 


0.03% 


- 


1000 


0.03% 


0.05 s 


0.02% 


0.02% 


3000 


0.01% 


0.14 s 


0.01% 


0.00% 


10000 


0.00% 


0.46 s 


0.00% 


- 


30000 


0.00% 


1.45 s 


0.00% 


- 


100000 


0.00% 


5.01 s 


0.00% 


- 


300000 


0.00% 


15.60 s 


0.00% 


- 


1000000 


0.00% 


53.90 s 


0.00% 


- 


3000000 


0.00% 


159.00 s 


0.00% 


- 


1000c 


2.90% 


0.05 s 


2.82% 


0.11 % 


3000c 


1.68% 


0.15 s 


1.59% 


0.26 % 


10000c 


3.27% 


0.49 s 


3.24% 


- 


30000c 


1.63% 


1.69 s 


1.61% 


- 


100000c 


2.53% 


5.51 s 


2.52% 


- 


300000c 


1.05% 


17.51 s 


1.05% 


- 



should also be noted that for this distribution, the relative error rapidly con- 
verges to zero. This is to be expected: for uniform distribution, the expected 
angle /(pi, c,pi-|_|) becomes arbitrarily close to tt. In more explicit terms: Both 
the value FWP/n and MWMP/n for a set of n random points in a unit square 
tend to the limit /- 1/2 + y'^dxdy ~ 0.3826. 

The third type uses n points that are chosen by selecting random points from 
a relatively small expected number k of “cluster” areas. Within each cluster, 
points are located with uniform polar coordinates (with some adjustment for 
clusters near the boundary) with a circle of radius 0.05 around a central point, 
which is chosen uniformly at random from the unit square. This type of instances 




Fig. 6. A typical cluster example with its matching. 
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is designed to make our heuristic look bad; for this reason, we have shown the 
results for k = 5. See Figure Elfor a typical example with n = 10, 000. 

It is not hard to see that these cluster instances behave very similar to frac- 
tional instances with k points; moreover, for increasing k, we approach a uniform 
random distribution over the whole unit square, meaning that the performance 
is expected to get better. But even for small k, the reader may take note that 
for cluster instances, the remaining error estimate is almost entirely due to lim- 
ited performance of the upper bound. The good quality of our fast heuristic for 
large problems is also illustrated by the fact that one hour of local search by 
Lin-Kernighan fails to provide any significant improvement. 

3 The Maximum TSP 

As we noted in the introduction, the geometric MTSP displays some peculiar 
properties when distances are measured according to some polyhedral norm. In 
fact, it was shown by Fekete 0 that for the case of Manhattan distances in the 
plane, the MTSP can be solved in linear time. (The algorithm is based in part 
on the observation that for planar Manhattan distances, FWP(P)=MWMP(P).) 
On the other hand, it was shown in the same paper that for Euclidean distances 
in or on the surface of a sphere, the MTSP is NP-hard. The MTSP has also 
been conjectured to be NP-hard for the case of Euclidean distances in . 

3.1 A Worst- Case Estimate 

Clearly, there are some observations for the MWMP that can be applied to 
the MTSP. In particular, we note that MTSP(P)< 2MWMP(P)< 2FWP(P). 
On the other hand, the lower-bound estimate of -\/3/2*FWP(P) that holds for 
MWMP(P) does not imply a lower bound of -\/3FWP(P) for the MTSP(P), as 
can be seen from the example in Fig. Q showing that a relative error of 17% is 
possible. 

However, we can argue that asymptotically, the worst-case ratio FWP(P)/ 
MTSP(P) is analogous to the ^ for the MWMP, i.e., within 15% of 2: 

Theorem 3. For oo, the worst-case ratio of FWP(P)/MTSP(P) tends to 
1/^3. 




Fig. 7. An example for which the ratio between FWP and MTSP is greater than 
0.577. 
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Proof: The proof of the ^ bound for the MWMP in ^01 establishes that 
any planar point set can be subdivided by six sectors of 7 t/ 3 around one center 
point, such that opposite sectors have the same number of points. This allows a 
matching between opposite sectors, establishing a lower bound of 27 t/ 3 for the 
angle between the corresponding rays. This means that we can simply choose 
three subtours, one for each pair of opposite sectors, and achieve the same worst- 
case ratio as for a matching. In order to merge these subtours, we only need three 
edges between adjacent sectors. If there more than n/2 points “far” from the 
center, i.e., at least J7(FWP(P)/n) away from the center, then the resulting 
error tends to 0 as n grows, and we get the same worst-case estimate as for the 
MWMP. 

This leaves the case that at least n/2 points are “close” to the center, i.e., 
only o(FWP(P)/n) from the center. Then we can can collect all points far from 
the center individually from the cluster close to the center. Now it is not hard 
to see that for this case, the length of the resulting tour converges to 2FWP(P). 

□ 



3.2 A Modified Heuristic 

For an even number of points in convex position, the choice of a maximum 
matching is rather straightforward. This leads to the CROSS heuristic described 
above. Similarly, it is easy to determine a maximum tour if we are dealing with 
an odd number of points in convex position: Each point pi gets connected to its 
two “cyclic furthest neighbors” and . However, the structure of an 

optimal tour is less clear for a point set of even cardinality, and therefore it is not 
obvious what permutations should be considered for an analogue to the matching 
heuristic CROSS. For this we consider the local modification called 2- exchanges: 
One pair of (disjoint) tour edges (pi,pj) and {pk,pi) gets replaced by the pair 
(pi,Pk) and (pj,pe), and the sequence p(, ... ,pi is reversed into pi, ... ,pi. 

Theorem 4. If the point set P is in convex position, then there are at most n/2 
tours that are locally optimal with respect to 2-exchanges, and we can determine 
the best in linear time. 

Proof: We claim that any tour that is locally optimal with respect to 2-exchanges 
must look like the one in Fig. El It consists of two diagonals (jpi,pi.\.ri) and 
(pi+i,Pi+i+|) (in the example, these are the edges (5,11) and (6,0)), while all 
other edges are near- diagonals, i.e., edges of the form (pj,pj+n_i). 

First consider 2-exchanges that increase the tour length: It is an easy con- 
sequence of triangle inequality that a noncrossing disjoint “antiparallel” pair of 
edges as eo and ei in Fig. EJa) allows a crossing 2-exchange that increases the 
overall tour length. In the following, we will will focus on identifying antiparallel 
noncrossing edge pairs. 

Now we show that all edges in a locally optimal tour must be diagonals or 
near-diagonals: Consider an edge eo = (Pi,Pj) with 0<j — z<^ — 2. Then 
there are at most f — 3 points in the subset Pi = [pi+i, . . . ,Pj-i], but at least 
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Fig. 8. A locally optimal MTSP tour. 




Fig. 9. Discussing locally optimal tours. 



^ + 1 points in the subset P2 = [pj+i, ■ ■ ■ ,Pi-i]- This implies that there must 
be at least two edges (say, ei and 62) within the subset P2- If either of them is 
antiparallel to eg, we are done, so assume that both of them are parallel. Without 
loss of generality assume that the head of 62 lies “between” the head of ei and 
the head pj of eg, as shown in Fig.| 3 (b). Then the edge eg that is the successor 
of 62 in the current tour is either antiparallel and noncrossing with ei, or with 
eg. 

Next consider a tour only consisting of diagonals and near-diagonals. Since 
there is only one 2-factor consisting of nothing but near-diagonals, assume with- 
out loss of generality that there is at least one diagonal, say (po,P^)- Then the 
successor of pn and the predecessor of po cannot lie on the same side of eg, as 
shown in Fig.^ Then there must be an edge eg within the set of points on 
the other side of eg. this edge is noncrossing with both eg and eg; either it is 
antiparallel to eg or to eg, and we are done. 

This implies that the existence of a diagonal in the tour and one of two 
possible choices of near-diagonals as the edge succeeding the diagonal in the 
tour determines the rest of the tour. Now it is straightforward to check that the 
resulting tour must look as in Fig. El concluding the proof. □ 

This motivates a heuristic analogous to the one for the MWMP. For simplic- 
ity, we call it CROSS’. See Fig. ^ From Theorem 0 it is easy to see that the 
following holds: 

Corollary 1 If the point set P is in convex position, then algorithm CROSS’ 
determines the optimum. 
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Algorithm CROSS’: Heuristic solution for MTSP 

Input: A set of points P G IP? . 

Output: A tour of P. 

1. Using a numerical method, find a point c that approximately minimizes 
the convex function miUj.g 2 fj 2 ^ d{c,pi). 

2. Sort the set P by angular order around c. Assume the resulting order is 
Pi, ■ ■ .,Pu. 

3. For i = 1, . . . ,n, connect point pi with point Pi^n_i. 

Compute the resulting total length L. 

4. Compute D = max"^i[d(pi,pi+n ) + d(pi+i,pi+i+n ) 

-d{pi,pi+^_i) - d(pi+l,Pi+n)]. 

5. Choose the tour of length L + D that arises by picking the two diagonals 
where the maximum in 4. is attained. 



Fig. 10. The heuristic CROSS’ 



3.3 No Integrality 

As the example in Fisr. llll shows. there may be fractional optima for the subtour 
relaxation of the MTSP. The fractional solution consists of all diagonals (with 
weight 1) and all near-diagonals (with weight 1/2). It is easy to check that this 
solution is indeed a vertex of the subtour polytope, and that it beats any integral 
solution. (See ^ on this matter.) This implies that there is no simple analogue 
to Theorem El for the MWMP, and we do not have a polynomial method that 
can be used for checking the optimal solution for small instances. 




Fig. 11. A fractional optimum for the subtour relaxation of the MTSP. 



3.4 Computational Experiments 

The results are of similar quality as for the MWMP. See Table Q Here we only 
give the results for the seven most interesting TSPLIB instances. Since we do 
not have an easy comparison with the optimum for instances of medium size, 
we give a comparison with the upper bound 2MAT, denoting twice the optimal 
solution for the MWMP. As before, this was computed by a network simplex 
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Table 2. Maximum TSP results for TSPLIB (top), uniform random (center), and 
clustered random instances (bottom) 



Instance 


CROSS’ 
vs. FWP 


time 


CROSS’ -b 
Ih Lin-Ker 


CROSS’ 
vs. 2MAT 


dsjlOOO 


1.36% 


0.05 s 


1.10% 


0.329% 


nrwl379 


0.23% 


0.01 s 


0.20% 


0.194% 


fnl4461 


0.34% 


0.12 s 


0.31% 


0.053% 


usal3509 


0.21% 


0.63 s 


0.19% 


- 


brdl4051 


0.67% 


0.46 s 


0.64% 


- 


dl8512 


0.15% 


0.79 s 


0.14% 


- 


pla85900 


0.03% 


3.87 s 


0.03% 


- 


1000 


0.04% 


0.06 s 


0.02% 


0.02% 


3000 


0.02% 


0.16 s 


0.01% 


0.00% 


10000 


0.01% 


0.48 s 


0.00% 


- 


30000 


0.00% 


1.47 s 


0.00% 


- 


100000 


0.00% 


5.05 s 


0.00% 


- 


300000 


0.00% 


15.60 s 


0.00% 


- 


1000000 


0.00% 


54.00 s 


0.00% 


- 


3000000 


0.00% 


160.00 s 


0.00% 


- 


1000c 


2.99% 


0.05 s 


2.87% 


0.11 % 


3000c 


1.71% 


0.15 s 


1.61% 


0.26 % 


10000c 


3.28% 


0.49 s 


3.25% 


- 


30000c 


1.63% 


1.69 s 


1.61% 


- 


100000c 


2.53% 


5.51 s 


2.52% 


- 


300000c 


1.05% 


17.80 s 


1.05% 


- 



method, exploiting the integrality result for planar MWMP. The results show 
that here, too, most of the remaining gap lies on the side of the upper bound. 

Table 0 shows an additional comparison for TSPLIB instances of moderate 
size. Shown are (1) the tour length found by our fastest heuristic; (2) the relative 
gap between this tour length and the fast upper bound; (3) the tour length 
found with additional Lin-Kernighan; (4) “optimal” values computed by using 
the Concorde cod^ for solving Minimum TSPs to optimality; (5) and (6) the 
two versions of our upper bound; (7) the maximum version of the well-known 
Held-Karp bound. 

In order to apply Concorde, we have to transform the MTSP into a Min- 
imum TSP instance with integer edge lengths. As the distances for geometric 
instances are not integers, it has become customary to transform distances into 
integers by rounding to the nearest integer. When dealing with truly geometric 
instances, this rounding introduces a certain amount of inaccuracy on the re- 
sulting optimal value. Therefore, Table 0 shows two results for the value OPT: 
The smaller one is the true value of the “optimal” tour that was computed by 
Concorde for the rounded distances, the second one is the value obtained by 

^ That code was developed by Applegate, Bixby, Chvatal, and Cook and is available 
at http: //www. caam.rice . edu/~keck/ Concorde .html. 
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re-transforming the rounded objective value. As can be seen from the table, even 
the tours constructed by our near-linear heuristic can beat the “optimal” value, 
and the improved heuristic value almost always does. This shows that our heuris- 
tic approach yields results within a widely accepted margin of error; furthermore, 
it illustrates that thoughtless application of a time-consuming “exact” methods 
may yield a worse performance than using a good and fast heuristic. Of course 
it is possible to overcome this problem by using sufficiently increased accuracy; 
however, it is one of the long outstanding open problems on the Euclidean TSP 
whether computations with a polynomially bounded number of digits in terms 
of n suffices for this purposes. This amounts to deciding whether the Euclidean 
TSP is in NP. See [Il|. 

The Held-Karp bound (which is usually quite good for Min TSP instances) 
can also be computed as part of the Concorde package. However, it is relatively 
time-consuming when used in its standard form: It took 20 minutes for instances 
with n K, 100, and considerably more for larger instances. Clearly, this bound 
should not be the first choice for geometric MTSP instances. 

Table 3. Maximum TSP results for small TSPLIB instances: Comparing CROSS’ and 
FWP with other bounds and solutions 



Instance 


CROSS’ 


CROSS’ 
vs. FWP 


CROSS’ 
+ Lin-Ker 


OPT 

via Concorde 


FWP’ 


FWP 


Held-Karp 

bound 


eillOl 


4966 


0.15% 


4966 


[4958, 4980] 


4971 


4973 


4998 


bierl27 


840441 


0.16% 


840810 


[840811, 840815] 


841397 


841768 


846486 


chl50 


78545 


0.12% 


78552 


[78542, 78571] 


78614 


78638 


78610 


gil262 


39169 


0.05% 


39170 


[39152, 39229] 


39184 


39188 


39379 


a280 


50635 


0.13% 


50638 


[50620, 50702] 


50694 


50699 


51112 


lin318 


860248 


0.09% 


860464 


[860452, 860512] 


860935 


861050 


867060 


rd400 


311642 


0.05% 


311648 


[311624, 311732] 


311767 


311767 


314570 


fl417 


779194 


0.18% 


779236 


[779210, 779331] 


780230 


780624 


800402 


rat783 


264482 


0.00% 


264482 


[264431, 264700] 


264492 


264495 


274674 


dl291 


2498230 


0.06% 


2498464 


[2498446, 2498881] 


2499627 


2499657 


2615248 
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Abstract. We present a generic package for resource constrained net- 
work optimization problems. We illustrate the flexibility and the use of 
our package by solving four applications: route planning, curve approx- 
imation, minimum cost reliability constrained spanning trees and the 
table layout problem. 



1 Introduction 

There are a large number of graph and network algorithms that can be efficiently 
solved in polynomial time, like the shortest path problem, minimum spanning 
tree problem or flow problems jAMOD.Sj . However, adding a single side constraint 
involving another cost function to these problems usually makes the problem NP- 
hard (see Garey and Johnson mmi). Since many practical applications can be 
modeled using resource constraints, it is of great interest to nevertheless solve 
the problem efficiently. 

In [MZOOj we studied the constrained shortest path problem which is to find a 
path of minimal cost satisfiying additional resource constraint(s). We extended 
previous methods of Handler and Zang [TT7?{m and Beasley and Christofides 
|BG8h| that solve the problem by first solving a Lagrangean relaxation and then 
closing the gap by path ranking. In our experiments we found that the method 
is efficient and clearly superior to ILP solving, naive path ranking and dynamic 
programming. It was also highly competitive to labeling methods. 

The same approach as in constrained shortest paths also applies to other net- 
work optimization problems with resource constraints: first solving a Lagrangean 
relaxation and then ranking solutions. 

Since the problem is of great practical interest in operations research we 
decided to develop a general package that provides efficient algorithms for specific 
problems and is easily adapted to other problems. 

* Partially supported by the 1ST Programme of the EU under contract number IST- 
1999-14186 (ALCOM-FT). 

** Partially supported by a Graduate Fellowship of the German Research Foundation 
(DFG). 
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This paper is organized as follows. In the next section we present the theory 
underlying constrained network optimization. The design of our package is dis- 
cussed in Sectional In Section 0 we describe four applications that illustrate the 
power and flexibility of the package. 

2 Underlying Theory 

Suppose we are given a network G with n nodes and m edges and a cost function 
c: E ^ defined on the edges. 

Many linear network optimization problems like shortest paths, minimum 
spanning trees or minimum cuts ask for the computation of a list of edges sat- 
isfying specific constraints (to form a path, a spanning tree etc.) such that the 
sum of the edge costs is minimized. 

Using 0-1-variables Ze for each edge in the graph this can be written as an 
integer linear program (ILP) as follows: 

min CeZe 

e&E 

S.t. Z € S 

where z G S abbreviates the specific constraints (path, spanning tree, etc). 

For many network optimization problems of this form there exist efficient 
algorithms to solve the problem in polynomial time 0{p{n,m, c)) (like in the 
case of shortest paths, minimum spanning trees etc.). 

If we also have a resource function r : E ^ defined on the edges and 
impose the additional constraint that the resource consumption of our optimal 
solution should not exceed a given limit A we get a (resource) constrained network 
optimization problem of the form: 

min Cf,Zf, 
eGB 

s.t. z G S 

TeZe < A 

The additional resource constraint usually makes the problem A'^P-hard as it 
is for example the case for constrained shortest paths or constrained minimum 
spanning trees (see 

For the constrained shortest path problem. Handler and Zang , Beasley 

and Christofides and Mehlhorn and Ziegelmann yMZnfl| proposed primal- 

dual methods to solve the Lagrangean relaxation of the problem. The key idea 
there is to relax the resource constraint in Lagrangean fashion, moving it into 
the objective function such that for a fixed Lagrange multiplier we only have to 
solve a corresponding unconstrained problem for modified costs. 

We now briefly review the method of Mehlhorn and Ziegelmann [M/0()| in 
the general setting: 
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We introduce integer variables Xsoi for each possible solution (e.g. path, span- 
ning tree, etc.) of our problem. So our problem can be stated as: 

min E ^sol^ sol 

S.t. ^ ^ X sol — 1 

^ ^ ^ sol^ sol ^ ^ 

If we drop the integrality constraint and set up the dual problem we get: 
max u + Xv 

s.t. u + VTsoi < Csoi ^sol 
V <0 

Now we have only two variables albeit a possibly exponential number of 
constraints. The (implicitly given) constraints can be interpreted as halfspace 
equations, so the problem is to find the maximal point in direction (1, A) in the 
halfspace intersection (see left part of Figure QJ. 





Fig. 1. (Left) Find the maximal point in direction (1, A) in the halfspace intersection. 
(Right) Find line with maximal c-value at r = A which has all points above or on it. 



Since we have a bicriteria problem, each solution has a cost and a resource 
value hence can be interpreted as a point in the r-c-plane. Thus the optimal 
solution of the relaxation is the line with maximal c-value at the limit line r = A 
having all points above or on it (see right part of Figure P). 

The hull approach pMZ()()| now finds this optimal line by computing extreme 
points, i.e. points on the lower hull. Starting with the segment defined by the 
minimum cost and the minimum resource point we want to compute the extreme 
point in its normal direction. The invariant is that we maintain the tentatively 
optimal hull segment of the extreme points seen so far. This is done in time 
0{p{n, m, c)) per iteration by solving unconstrained problems with scaled costt0 

Ce = Ce - VTe. 

V is the slope of the current segment 
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showed that for integral costs and re- 



sources (C and R denoting the maximum cost and resource of an edge, respec- 
tively) the number of iterations in the hull approach is 0{log{nRC)). Their 
proof also extends for the general constrained network optimization case, so we 
can conclude that the hull approach computes the optimum of the relaxation in 
0(log{nRC)p(n,m,c)) time which is polynomial in the input. 

Solving the relaxation with the hull approach we obtain a feasible solution 
giving an upper bound for the problem whereas the optimal solution of the 
relaxation provides a lower bound. A duality gap may exist since the upper 
bound, the line found by the hull approach and the limit line define an area 
where the true optimal solution may lie (see Figure E|). 

We may close this gap by a solution ranking procedure, we enumerate so- 
lutions by increasing scaled cost. Lawler gave a general method for ranking 
discrete optimization problems |T.aw72) (there exist special algorithms for path 
ranking IIVmiHFpphQl and spanning tree ranking IKTM8HEpp90| ). 

In the gap closing step we rank the solutions with respect to the scaled costs 
corresponding to the optimum of the relaxation. This can be viewed as sweeping 
the area in which the optimal solution may lie with the line found by the hull 
approach. We update the bounds during the ranking process and may stop when 
we have swept the promising area (see Figure E). 

Here is where AP-hardness comes in, we cannot give a polynomial bound on 
the number of solutions that have to be ranked. There also is no approximation 
guarantee for the bounds resulting from the relaxation |lVIZnfl| . However, ex- 
periments for the constrained shortest path problem 



isecifl mmi KfRsoi 



show that the bounds are usimlly very good and hence lead to an efficient gap 
closing step (see also Table | 




Fig. 2. Closing the gap between upper and lower bound. 



^ A straightforward adaption of the hull approach for a single resource enables the 
computation of the whole lower hull of the underlying problem in 0{{nRC)^^^) time 
which provides an idea of how good the bounds are for different resource limits. 
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The hull approach also works for multiple resources (i.e. having k resource 
functions and k resource limits)0. The gap closing step via solution ranking also 
extends. 

We obtain a general 2-phase method for constrained network optimization 
problems for an arbitrary number of resources: first solve the Lagrangean relax- 
ation and then close the duality gap via solution ranking. 

Apart from the hull approach^ there also exist other methods for solving 
the Lagrangean relaxation: In the single resource case, we can also use binary 
search on the slopes [IXuel)l)flVI/()()] . Beasley and Christofides proposed 

a subgradient procedure that approximates the optimum of the relaxation (also 
for multiple resources). 

In the next section we describe a software package that offers all these relax- 
ation approaches and provides a generic framework for the 2-phase method for 
constrained network optimization. 

3 Design of the CNOP Package 

We describe CNOP, a Constrained Network Optimization Package. We imple- 
mented the package in C-|— I- using LED A jMNflflpMNSUj . 

The main function of the package is 

RESULT cnop (G , s , t , cost , resource , upperbound , netopt , reinking) ; 

Here G is the graph of the underlying network problem, s and t are nodes, cost 
is the cost function defined on the edges, resource is the resource function(s) 
defined on the edges and upperbound is the resource limit (s). 

netopt is a function used as core function by the relaxation methods. It is 
of general form 

list<edge> netopt(G,s,t, cost, resource, scalevector, c, newpoint) ; 

where scalevector specifies how to scale the new weight function for the net- 
work optimization. The result of the scaled optimization is returned as a list of 
edges, its value in c and the cost and resource value of the solution is returned 
in newpoint. 

When the user has specified this function and the desired method to solve 
the relaxation (hull approach, binary search and subgradient procedure in the 
single resource case and hull approach and subgradient procedure in the multiple 
resource case), the optimal value of the relaxation is returned. 

To close the possible duality gap the user now has to specify the function 
ranking which should be of the form 

bool rcUiking(G,s,t, cost, resource, upperbound, scaledcost, 
scalevector, UB, LB, UBsol) ; 

® However, tKTTTTni could not show a polynomial bound on the number of iterations of 
the hull approach for multiple resources. 

* The method of Handler and Zang |H/SI)| for the single resource case is very similar 
to the hull approach. 
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where scaledcost is the scaled cost function leading to the optimum of the re- 
laxation, UB and LB are upper and lower bounds and UBsol the solution resulting 
from the relaxation. Now the ranking function should rank the solutions with 
respect to weight function scaledcost and update the bounds UB, LB and best 
feasible solution UBsol until the optimum is found or unfeasibility is detected. 

The user also may provide a special function testing the feasibility of a solu- 
tion and thus may also incorporate lower bounds on the resource consumption, 
etc. 

Since the gap closing step may take a long time depending on the scaled costs 
it is possible to specify break criteria, e.g. the number of solutions to be ranked 
or difference between upper and lower bound. 

This provides the generic concept for constrained network optimization as 
discussed in Section 0 The cnop function returns a list of edges specifying the 
optimal solution or reports infeasibility. If the ranking was aborted before closing 
the duality gap we return lower and upper bound with the best feasible solution. 

At the moment our package is tested for gcc 2.95.2 under Solaris but we 
plan to support other compilers and platforms in the future as well as pro- 
viding CNOP as a LED A extension package. The hull approach for multiple 
resources makes use of either the implementation of the d-dimensional convex 
hull algorithm in the Geokernel LEP or the CPLEX callable 

library Eza, the hull approach for the single resource case works without ad- 
ditional software. 



3.1 Special Case: Constrained Shortest Paths 



The constrained shortest path problem is to find a minimum cost path between 
two nodes that satisfies the resource limit(s). 

Previous work on constrained shortest paths can be categorized into three 
different approaches: dynamic programming methods |,loktiti| . labeling meth- 
ods I A AIN 8 3 ID D S S 9 5 1 and 2-phase methods |H/80lljC891Vl/()fl] first solving the 
Lagrangean relaxation and then closing the duality gap (see Section |21) . 

CNOP offers a special function for constrained shortest paths: 

As core algorithm for the relaxation methods in our generic approach we use 
LEDA’s implementation of Dijkstra’s algorithm but a user also may choose an 
own shortest path implementation. 

For the gap closing step we offer three different methods: We reimplemented 
the recursive enumeration algorithm of Jiminez and Marzal |.l IVI99) which seems 
to perform better in practice than the theoretically best known A:-shortest path 
algorithm of Eppstein |Epp99| . Moreover, we implemented a labeling setting 
algorithm with certain heuristics ^ZflOj that is similar to the one proposed 
by Aneja et al. PAAN83I and a label correcting approach as in jDDSS95| . We 
also offer the dynamic programming implementation of Joksch |.Tok6filPC89j . Of 
course it is also possible to provide an own implementation for the gap closing 
step. 

In addition to the general framework, we also offer problem reduction meth- 
ods as proposed by Aneja et al. |AAJN83) and Beasley and Christofides |BC89j . 
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Nodes and edges of the graph are removed whose inclusion in a path would force 
the resource consumption over the resource limit or the length over the upper 
bound on the optimal solution. These reductions may reduce the size of a prob- 
lem considerably enabling a faster gap closing step. They can be switched on or 
off by the user. 

So a user is able to experiment with all the known “state of the art” methods, 
trying out arbitrary combinations to see which setting fits best for a particular 
application. 

The general function looks like this: (reductions are turned on, and the k- 
shortest path algorithm is used per default) 

csp (G , s , t , cost , resource , upperbound) ; 

Default parameters enable the user to turn on/off reductions and relaxation 
computation and switch between different methods (or provide new ones). 



3.2 Special Case: Constrained Minimum Spanning Trees 



The constrained minimum spanning tree problem is to find a spanning tree of 
minimal cost that satisfies the resource limit(s). 

Ahuja et al. lAMOhHI first reported that the Lagrangean relaxation can be 
solved with a subgradient procedure. Xue gave a binary search procedure 

for solving the Lagrangean relaxation in the single resource case. Using our 
hull approach we may solve the relaxation exactly also for multiple resources, 
moreover in the single resource case for integral costs and resources we get a 
polynomial running time for solving the relaxation. 

We also offer a special function for the constrained minimum spanning tree 
problem. To the best of our knowledge, this is the first implementation for the 
constrained minimum spanning tree problem. 

As core algorithm for the relaxation methods in our generic approach we use 
LEDA’s implementation of Kruskal’s minimum spanning tree algorithm. A user 
again may provide an own implementation. 

For the gap closing step we implemented the spanning tree ranking algorithm 
of Katoh, Ibaraki and Mine |KlM8l|T Other methods may be provided by the 
user. 

Like in the constrained shortest path case there are problem reductions pos- 
sible. We used ideas of Eppstein |Epp90| to also provide functions for problem 
reductions in the constrained spanning tree case. They can be turned on or off 
by the user. 

The general function looks like this: (reductions are turned on, and the span- 
ning tree ranking algorithm is used per default) 



cmst (G , s , t , cost .resource , upperbound) ; 

Default parameters enable the user to turn on/off reductions and relaxation 
computation and switch between different methods (or provide new ones). 



® We will also offer the improvement of Eppstein |Epp9L)| in the near future. 
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4 Applications 

We now discuss four applications demonstrating the flexibility and the use of 
our package. At the end of this section we report some running times. 

4.1 Route Planning 

Route planning is a standard application of constrained shortest paths. We are 
given a road or train network and a source and target destination. We want to 
travel from source to target destination with minimal cost satisfying the resource 
constraint(s). Here, costs could be for example time and resource could be fuel 
consumption (see left part of Figure EJ. Many other cost and resource models 
exist. For example we may want to minimize the total height difference while 
not travelling more than a given distance, (see right part of Figure Ej). After 
setting up the graph and the edge costs and resources we can simply use the csp 
function of CNOP. 

4.2 Curve Approximation 

Piecewise linear curves (or polygons) are often used to approximate complex 
functions or geometric objects in areas like computer aided geometric design, 
cartography, computer graphics, image processing and mathematical program- 
ming. Often, these curves have a very large number of breakpoints, and for 
storage space or other reasons, one may be interested in an approximation of 
the curve using fewer breakpoints. 




Fig. 3. Minimum time satisfying fuel constraints (congestion areas are shaded) (left) 
Minimum total height difference satisfying length constraint (right). The minimum cost 
paths are yellow, the minimum resource paths brown and the constrained shortest paths 
green. 
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Dahl and Realfsen, and Nygaard et al. |l )H,h7IN HH98IDH,00| studied this 
problem and showed how to model this as a constrained shortest path problem: 
The breakpoints v\, . . . ,Vn are the nodes of the graph and for all 1 < f < j < 
n we introduce an edge (vi,Vj). The cost of an edge is set to the approximation 
error introduced by taking this shortcut instead of the original curvc0. The re- 
source of an edge is set to 1. Now we may compute the best approximation using 
at most k breakpoints with our general approach (see Figure 0 for an example). 
Alternatively, we may also compute the minimum number of breakpoints for a 
given approximation error. 




Fig. 4. Coastline of Corsica (800 points) and minimum error approximation nsing 200 
points 



Since we now have identical resources and k < n, the problem is not NP-hard 
anymore, since the dynamic programming formulation now runs in 0{kn^) = 
0{n^). 

Dahl and Realfsen fPPOTIDPOOl observed that solving the relaxation out- 
performs the exact dynamic programming but pointed out the problem that it 
cannot guarantee optimality although the optimum was often reached in their 
experiments. Whereas Nygaard et al. consider the dynamic program- 

ming method as state of the art. 

Using our package, we can confirm the experiments of inHilTIDROOj for larger 
number of breakpoints, comparing the 2-phase approach with dynamic program- 
ming (see also Table E|). Moreover, the gap closing time is usually dominated by 
solving the relaxation. 

There are numerous other applications of the constrained shortest path prob- 
lem: 



Different error metrics may be used. 
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Modeling Engineering Problems as Constrained Shortest Paths. Elimam and 
Kohler IIEK97I show how to model two engineering applications as constrained 
shortest path problems: optimal sequences for the treatment of wastewater pro- 
cesses and minimum cost energy-efhcient composite wall and roof structures. 

Constrained Shortest Paths as a Subproblem. There are several applications in 
operations research that involve constrained shortest paths as a subproblem. 
Most of the time these are column generation methods. Examples are duty 
scheduling PEKLOOj . traffic routing with congestion j.TMS99] and scheduling 
switching engines 

Constrained Ceodesic Shortest Paths in TINs. Given a triangulated irregular 
network (TIN) describing a terrain surface, we also may ask for the continuous 
(geodesic) version of the constrained shortest path problem. Now the length of a 
path link is its geodesic length and the resource of a path link could for example 
be its slope. Now we have geodesic shortest paths as a core problem. We reim- 
plemented the approximate scheme of Lanthier, Maheswari and Sack |LMS()()| . 
Plugging this core algorithm into our hull approach we get an approximate so- 
lution to the optimal relaxation |MS/()()| . 

4.3 Minimum Cost Reliability Constrained Spanning Trees 

The previous applications all had shortest paths as a core routine. Now we 
turn to a different constrained network problem, the minimum cost reliability 
constrained spanning tree problem that arises in communication networks: We 
are given a set of n stations in the plane that can communicate with each other. 
We now want to connect the stations, the cost of a connection might be modeled 




Fig. 5. Minimum cost spanning tree and Minimum cost reliability constrained span- 
ning tree. Width of edges corresponds to fault probability. 
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by the distance of the stations and the reliability of a connection by its fault 
probability. We now want to compute a minimum cost connection (spanning 
tree) such that its total fault probability is beyound a given limit. 

After setting up the graph with its costs and resources we can simply use the 
cmst function of CNOP. Figure |3 gives an example. 



4.4 Table Layout 

Consider the problem of laying out a two-dimensional table. Each cell of the table 
has content associated with it and we have choices on the geometry of cells. The 
table layout problem is to choose configurations for the cells to minimize the 
table height given a fixed width for the table (see Tables ^ and |3 that display 
excerpts of the book bestseller list in Germany). The problem is motivated by 
typography, both traditional print-to-paper and online web typography. 

Anderson and Sobti jAShhj studied this problem and showed how to model 
it as a flow problem in a graph. Solving scaled minimum cut computations as in 
the hull approach they solve a relaxation. 

We reimplemented their graph modeling and apply our general hull scheme 
using the max flow-minimum cut algorithm of LEDA. After setting up the graph 
and the special minimum cut function we may use our package. 

The hull approach gives us the optimal solution of the relaxation and a 
feasible solution providing an upper bound for the table layout problem. If we 
want to have the optimal solution we have to rank the cuts in increasing reduced 
cost order. This can be done with the algorithm of Hamacher et al. FTOl . 
We plan to include an implementation of this algorithm in the future. 



Table 1. Minimum height table without width constraints. 



Belletristik 


Sachbiicher 


Joanne Rowling: Harry Potter 


Marcel Reich- Ranicki: Mein Leben 


Rosamunde Pilcher: Wintersonne 


Lance Armstrong: Tour des Lebens 


Donna Leon: In Sachen Signora Brunetti 


Florian lilies: Generation Golf 



Table 2. Minimum height table with total width < 29 characters. 



Belletristik 


Sachbiicher 


Joanne 
Rowling: 
Harry Potter 


Marcel Reich- 
Ranicki: Mein 
Leben 


Rosamunde 

Pilcher: 

Wintersonne 


Lance 

Armstrong: Tour 
des Lebens 


Donna Leon: In 
Sachen Signora 
Brunetti 


Florian lilies: 
Generation Golf 



28 



K. Mehlhorn and M. Ziegelmann 



The symmetric problem of minimizing the table’s width for a given height is 
also easily solved using our package. Our package would also allow to solve a 3d 
table layout problem where each cell also has a length, i.e. minimize height of 
the table given limits on the width and length of the table. 



Experiments 

At the end of this section we want to give an idea about the running time of 
different provided constrained shortest path methods on the discussed appli- 
cations. Since the running time is dependent on the size and structure of the 
graph, the distribution of the edge costs and resources and the “hardness” of 
the constraint (s), we cannot give a detailed running time comparison of differ- 
ent methods. However, Table El gives some ideas about the efficiency of different 
constrained shortest path methods on various kinds of graphfl We see that the 
2-phase method (hull approach -|- gap closing) is superiour to naive approaches 
like ILP solving or dynamic programming. It is also faster than the labeling ap- 
proach, especially for larger graphs, except in the case of road graphs where it 
is is slightly worse than the labeling. Moreover, the running time of the 2-phase 
method does not seem to depend too much on the structure of the graph but just 
on the size of the graph. A user can easily do a detailed running time comparison 
for a specific problem using the package. 



5 Summary 

We proposed a package for constrained network optimization problems. Even 
though there exist several implementations of different algorithms for the con- 
strained shortest path problem, this is the first publicly available package that 
provides implementations of state of the art algorithms and allows flexible ex- 
change of different methods. 

The package also provides the first available implementation for computing 
constrained minimum spanning trees. 

The general framework can easily be adapted to solve other constrained net- 
work optimization problems as we have seen in the section about the table layout 
problem. 

We believe that the flexible design of the package makes it very easy to use. 
Moreover, it is customizable to the user’s needs. Due to the practical importance 
of the provided methods we believe that it will be useful - at least for experi- 
mental purposes - for a large number of people working in operations research. 

Future work includes implementing some other ranking algorithms and thus 
providing complete optimization code for other examples. We also plan to offer 
the package as a LEDA extension package (LEP) and support other platforms 
and compilers. 

^ All experiments measuring CPU time in seconds on a Sun Enterprise 333MHz, 6Gb 
RAM running SunOS 5.7 using LEDA 4.3 and CNOP compiled with gcc-2.95.2 -O. 
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Table 3. Running time for different constrained shortest path methods on different 
graph structures. The first column describes the graph type. DEM stands for digital 
elevation models (see Figure El (right)), i.e. bidirected grid graphs with cost of an edge 
equal to the height difference of its nodes and resource of an edge randomly chosen 
from [10,20]. ROAD stands for US road graphs (see Figure 0 (left)) with the cost of an 
edge equal to the congestion time and the resource of an edge equal to its euclidean 
length. CURVE stands for graphs obtained from approximating a random linear signal 
(see Subsection (but here we limit the outdegree of the nodes to 10)) with cost of 
an edge being the approximation error and resource of an edge equal to 1. Columns 
2 and 3 contain the number of nodes and edges of the graph. Column 4 contains the 
resource limit (which is set to 110% of the minimum resource path in the graph in 
the DEM and ROAD case) and Column 5 contains the optimum. The following three 
columns contain upper and lower bound obtained by the hull approach and the number 
of iterations. The last four columns contain the running time in seconds for the 2-phase 
approach, label correcting, dynamic programming and ILP solving. A means that 
computation was aborted after 5 minutes. 



type 


n 


m 


A 


OPT 


UB 


LB 


it 


rel-tgc 


label 


DP 


ILP 


DEM 


11648 


46160 


2579.5 


18427 


18509 


18390.5 


8 


1.6 


4.2 


- 


- 


DEM 


41600 


165584 


4650.8 


27863 


27913 


27836.6 


10 


8.7 


40.3 


- 


- 


DEM 


16384 


65024 


2810.5 


3669 


4381 


3626.9 


7 


2.3 


5.5 


- 


- 


DEM 


63360 


252432 


5897.1 


5695 


6045 


5676.6 


8 


11.7 


85.9 


- 


- 


ROAD 


24086 


50826 


816.9 


3620 


3620 


3513.3 


6 


1.9 


1.0 


- 


- 


ROAD 


77059 


171536 


1324.9 


400 


400 


392.8 


6 


8.2 


7.1 


- 


- 


CURVE 


1000 


9945 


300 


5977.23 


5977.23 


5977.23 


9 


0.27 


1.9 


2.6 


8.1 


CURVE 


5000 


49945 


800 


72097.7 


72220 


72097 


12 


1.6 


14.7 


52.3 


150.4 


CURVE 


10000 


99945 


2000 


107208 


107208 


107208 


13 


3.3 


103.4 


- 


- 



The CNOP package including manual and applications can be freely down- 
loaded for non-commercial use from the URL 

http : //www . mpi-sb . mpg . de/~mark/ cnop 
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Abstract. The purpose of this paper is to provide a preliminary report 
on the first broad-based experimental comparison of modern heuristics 
for the asymmetric traveling salesmen problem (ATSP). There are cur- 
rently three general classes of such heuristics: classical tour construction 
heuristics such as Nearest Neighbor and the Greedy algorithm, local 
search algorithms based on re-arranging segments of the tour, as exempli- 
fied by the Kanellakis-Papadimitriou algorithm [KP80], and algorithms 
based on patching together the cycles in a minimum cycle cover, the 
best of which are variants on an algorithm proposed by Zhang [Zha93]. 

We test implementations of the main contenders from each class on a 
variety of instance types, introducing a variety of new random instance 
generators modeled on real-world applications of the ATSP. Among the 
many tentative conclusions we reach is that no single algorithm is dom- 
inant over all instance classes, although for each class the best tours are 
found either by Zhang’s algorithm or an iterated variant on Kanellakis- 
Papadimitriou. 

1 Introduction 

In the traveling salesman problem, one is given a set of N cities and for each 
pair of cities Ci, Cj a distance d(ci, Cj). The goal is to find a permutation tt of the 
cities that minimizes 
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Most of the research on heuristics on this problem has concentrated on the 
symmetric case (the STSP), where d{c,c') = d{c',c) for all pairs of cities c, c'. 
Surveys such as ||-!(d'i||[TTnTrj experimentally examine the performance of a wide 
variety of heuristics on reasonably wide sets of instances, and many papers study 
individual heuristics in more detail. For the general not-necessarily-symmetric 
case, typically referred to as the “asymmetric TSP” (ATSP), there are far fewer 
publications, and none that comprehensively cover the current best approaches. 
This is unfortunate, as a wide variety of ATSP applications arise in practice. In 
this paper we attempt to begin a more comprehensive study of the ATSP. 

The few previous studies of the ATSP that have made an attempt at cover- 
ing multiple algorithms IZh^ ^h^ ITTTTTI have had several draw- 

backs. First, the classes of test instances studied have not been well- motivated 
in comparison to those studied in the case of the symmetric TSP. For the latter 
problem the standard testbeds are instances from TSPLIB |Pei91 1 and randomly- 
generated two-dimensional point sets, and algorithmic performance on these in- 
stances seems to correlate well with behavior in practice. For the ATSP, TSPLIB 
offers fewer and smaller instances with less variety, and the most commonly stud- 
ied instance classes are random distance matrices (asymmetric and symmetric) 
and other classes with no plausible connection to real applications of the ATSP. 

For this study, we have constructed seven new instance generators based on a 
diverse set of potential ATSP applications, and we have tested a comprehensive 
set of heuristics on classes produced using these generators as well as the tradi- 
tional random distance matrix generators. We also have tested the algorithms 
on the ATSP instances of TSPLIB and our own growing collection of instances 
from actual real-world applications. 

We have attempted to get some insight into how algorithmic performance 
scales with instance size, at least within instance classes. To do this, we have 
generated test suites of ten 100-city instances, ten 316-city instances, three 1000- 
city instances, and one 3162-city instance for each class, the numbers of cities 
going up by a factor of approximately VTO at each step. At present, comprehen- 
sive study of larger instances is difficult, given that the common interface between 
our generators and the algorithms is a file of all N'^ inter-city distances, which 
would be over 450 megabytes for a 10,000-city instance. Fortunately, trends are 
already apparent by the time we reach 3162 cities. 

We also improve on previous studies in the breadth of the algorithms that 
we cover. Current ATSP heuristics can be divided into three classes: (1) classical 
tour construction heuristics such as Nearest Neighbor and the Greedy algorithm, 
(2) local search algorithms based on re-arranging segments of the tour, as ex- 
emplified by the Kanellakis-Papadimitriou algorithm p™! - and (3) algorithms 
based on patching together the cycles in a minimum cycle cover (which can be 
computed as the solution to an Assignment Problem, i.e., by constructing a min- 
imum weight perfect bipartite matching) . Examples of this last class include the 
algorithms proposed in and the 0{logN) worse-case ratio “Re- 

peated Assignment” algorithm of jFCM82j . We cover all three classes, including 
recent improvements on the best algorithms in the last two. 
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In the case of local search algorithms, previous studies have not considered 
implementations that incorporate such recent innovations from the symmetric 
TSP world as “don’t-look bits” and chained (iterated) local search. Nor have 
they made use of the recent observation of that the best “2-bridge” 4- 

opt move can be found in time O(fV^) rather than the naive 0{N^). Together, 
these two ideas can significantly shift the competitive balance. 

As to cycle-patching algorithms, these now seem to be dominated by the pre- 
viously little-noted “truncated depth-first branch-and-bound” heuristic of Zhang 
fZhahdj . We present the first independent confirmation of the surprising results 
claimed in that paper, by testing both Zhang’s own implementation and a new 
one independently produced by our first author. 

A final improvement over previous ATSP studies is our use of the Held-Karp 
(HK) lower bound on optimal tour length |HK7n| llTKTil I.TMP,9Bj as our stan- 
dard of comparison. Currently all we know theoretically is that this bound is 
within a factor of log N of the optimal tour length when the triangle inequality 
holds [Wil92| . In practice, however, this bound tends to lie within 1 or 2% of the 
optimal tour length whether or not the triangle inequality holds, as was observed 
for the symmetric case in |,TMP.96j and as this paper’s results suggest is true for 
the ATSP as well. By contrast, the Assignment Problem lower bound (AP) is 
often 15% or more below the Held-Karp bound (and hence at least that far be- 
low the optimal tour length) . The Held-Karp bound is also a significantly more 
reproducible and meaningful standard of comparison than “the best tour length 
so far seen,” the standard used for example in the (otherwise well-done) study 
of |Kep94| . We computed Held-Karp bounds for our instances by performing the 
NP-completeness transformation from the ATSP to the STSP and then apply- 
ing the publicly available Concorde code of |ABCC9R| . which has options for 
computing the Held-Karp bound as well as the optimal solution. Where feasible, 
we also applied Concorde in this way to determine the optimal tour length. 

The remainder of this paper is organized as follows. In Section 2 we provide 
high-level descriptions of the algorithms whose implementations we study, with 
citations to the literature for more detail where appropriate. In Section 3 we 
describe and motivate our instance generators and the classes we generate using 
them. In Section 4, we summarize the results of our experiments and list some 
our tentative conclusions. This is a preliminary report on an ongoing study. A 
final section (Section 5) describes some of the additional questions we intend to 
address in the process of preparing the final journal version of the paper. 



2 Algorithms 

In this section we briefly describe the algorithms covered in this study and our 
implementations of them. Unless otherwise stated, the implementations are in 
C, but several were done in C++, which may have added to their running time 
overhead. In general, running time differences of a factor of two or less should not 
be considered significant unless the codes come from the same implementation 
family. 
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2.1 Tour Construction: Nearest Neighbor and Greedy 

We study randomized ATSP variants of the classical Nearest Neighbor (NN) and 
Greedy (GR) algorithms. In the former one starts from a random city, and then 
successively goes to the nearest as-yet-unvisited city. In our implementation of 
the latter, we view the instance as a complete directed graph with edge lengths 
equal to the corresponding inter-city distances. Sort the edges in order of increas- 
ing length. Call an edge eligible if it can be added to the current set of chosen 
edges without creating a (non-Hamiltonian) cycle or causing an in- or out-degree 
to exceed one. Our algorithm works by repeatedly choosing (randomly) one of 
the two shortest eligible edges until a tour is constructed. Randomization is im- 
portant if these are to be used for generating starting tours for local optimization 
algorithms, since a common way of using the latter is to perform a set of runs 
and taking the best. 

Running times for these algorithms may be inflated from what they would 
be for stand-alone codes, as we are timing the implementations used in our 
local search code. These start by constructing ordered lists of the 20 nearest 
neighbors to and from each city for subsequent use by the local search phase of 
the algorithms. The tour construction code exploits these lists where possible, 
but the total time spend constructing the lists may exceed the benefit obtained. 
For NN in particular, where we have an independent implementation, we appear 
to be paying roughly a 50% penalty in increased running time. 



2.2 Patched Cycle Cover 

We implement the patching procedure described in ^<.S85j . We first compute a 
minimum cycle cover using a weighted bipartite matching (assignment problem) 
code. We then repeatedly select the two biggest cycles and combine them into the 
shortest overall cycle that can be constructed by breaking one edge in each cycle 
and patching together the two resulting directed paths. Despite the potentially 
superquadratic time for this patching procedure, in practice the running times 
for this algorithm are dominated by that for constructing the initial cycle cover 
(a potentially cubic computation). This procedure provides better results than 
repeatedly patching together the two shortest cycles. 

A variety of alternative patching procedures are studied in Ee^cnEzi 
EZj, several of which, such as the “Contract-or-Patch” heuristic of prrvT] . 
obtain better results at the cost of greater running time. The increase in running 
time is typically less than that required by Zhang’s algorithm (to be described 
below). On the other hand, Zhang’s algorithm appears to provide tour quality 
that is always at least as good and often significantly better than that reported 
for any of these alternative patching procedures. In this preliminary report we 
wish to concentrate on the best practical heuristics rather than the fastest, and 
so have not yet added any of the alternative patching procedures to our test 
suite. The simple PATCH heuristic provides an interesting data point by itself, as 
its tour is the starting point for Zhang’s algorithm. 



36 



J. Cirasella et al. 



2.3 Repeated Assignment 

This algorithm was originally studied in |F(TlVT82j and currently has the best 
proven worst-case performance guarantee of any polynomial-time ATSP heuris- 
tic (assuming the triangle inequality holds): Its tour lengths are at most log A 
times the optimal tour length. This is not impressive in comparison to the 3/2 
guarantee for the Christofides heuristic for the symmetric case, but nothing bet- 
ter has been found in two decades. 

The algorithm works by constructing a minimum cycle cover and then re- 
peating the following until a connected graph is obtained. For each connected 
component of the current graph, select a representative vertex. Then compute 
a minimum cycle cover for the subgraph induced by these chosen vertices and 
add that to the current graph. A connected graph will be obtained before one 
has performed more than log N matchings, and each matching can be no longer 
than the optimal ATSP tour which is itself a cycle cover. Thus the total edge 
length for the connected graph is at most log N times the length of the optimal 
tour. Note also that it must be strongly connected and all vertices must have 
in-degree equal to out-degree. Thus it is Eulerian, and if one constructs an Euler 
tour and follows it, shortcutting past any vertex previously encountered, one 
obtains an ATSP tour that by the triangle inequality can be no longer than the 
total edge length of the graph. 

We have implemented two variants on this, both in C++. In the first (RA) one 
simply picks the component representatives randomly and converts the Euler 
tour into a ATSP tour as described. In the second (RA+) we use heuristics to find 
good choices of representatives and rather than following the Euler tour, we use 
a greedy heuristic to pick, for each vertex in turn that has in- and out-degree 
exceeding 1, the best ways to short-cut the Euler tour so as to reduce these 
degrees to 1. This latter approach improves the tours found by Christofides in 
the STSP by as much as 5% |.TPMP,j . The combination of the two heuristics 
here can provide even bigger improvements, depending on the instance class, 
even though RA+ takes no more time than RA. Unfortunately, RA is often so bad 
that even the substantially improved results of RA+ are not competitive with 
those for PATCH. To save space we shall report results for RA+ only. 



2.4 Zhang’s Algorithm and Its Variants 

As described in (zEinn], Zhang’s algorithm is built by truncating the computa- 
tions of an AP-based branch-and-bound code that used depth first search as its 
exploration strategy. One starts by computing a minimum length cycle cover Mq 
and determining an initial champion tour by patching as in PATCH. If this tour 
is no longer that Mq (for instance if Mq was itself a tour), we halt and return 
it. Otherwise, call Mq the initial incumbent cycle cover, and let /q, the set of 
included edges and Xq, the set of excluded edges, be initially empty. The variant 
we study in this paper proceeds as follows. 

Inductively, the incumbent cycle cover Mi is the minimum length cycle cover 
that contains all edges from li and none of the edges from Xi, and we assume 
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that Mi is shorter than the current champion tour and is not itself a tour. Let 
C = {ei, 62, ... , 6fe} be a cycle (viewed as a set of edges) of minimum size in Mi. 
As pointed out in KTVm . there are k distinct ways of breaking this cycle: One 
can force the deletion of ei, retain ei and force the deletion of 62, retain ei and 
62 and force the deletion of 63, etc. We solve a new matching problem for each 
of these possibilities that is not forbidden by the requirement that all edges in A 
be included and all edges in Xi be excluded. In particular, for all h, 1 < h < k 
such that 6 h is not in li and W 0 {e^ : 1 < j < /i} = <() we construct a minimum 
cycle cover that includes all the edges in U {cj '■ 1 < j < h} and includes none 
of the edges in Xi U {e/j}. (The exclusion of the edges in this latter set is forced 
by adjusting their lengths to a value exceeding the initial champion tour length, 
thus preventing their use in any subsequent viable child.) If one retains the data 
structures used in the construction of Mi each new minimum cycle cover can be 
computed using only one augmenting path computation. 

Let us call the resulting cycle covers the children of Mi. Call a child viable if 
its length is less than the current champion tour. If any of the viable children is 
a tour and is better than the current champion, we replace the champion by the 
best of these (which in turn will cause the set of viable children to shrink, since 
now none of the children that are tours will be viable). If at this point there 
is no viable child, we halt and return the best tour seen so far. Otherwise, let 
the new incumbent Mi^i be a viable child of minimum length. Patch Mi+i and 
if the result is better than the current champion tour, update the latter. Then 
update li and Xi to reflect the sets of included and excluded edges specified 
in the construction of and continue. This process must terminate after at 

most N'^ phases, since each phase adds at least one new edge to A U Xj, and so 
we must eventually either construct a tour or obtain a graph in which no cycle 
cover is shorter than the initial champion tour. 

We shall refer to Zhang’s C implementation of this algorithm as ZHANGl. 
We have also studied several variants. The two most significant are ZHANG2 in 
which in each phase all viable children are patched to tours to see if a new 
champion can be produced, and ZHANGO, in which patching is only performed on 
Mq. ZHANG2 produces marginally better tours than does ZHANGl, but at a typical 
cost of roughly doubling the running time. ZHANGO is only slightly faster than 
ZHANGl and produces significantly worse tours. Because of space restrictions we 
postpone details on these and other variants of ZHANGl to the final report. 

That final report will also contain results for an independent implementa- 
tion of a variant of ZHANGl by Cirasella, which we shall call ZHANGl-C. This 
code differs from ZHANGl in that it is implemented in C++, uses different tie- 
breaking rules, and departs from the description of ZHANGl in one detail: only 
the shortest viable child is checked for tour-hood. All things being equal, this 
last change cannot improve the end result even though it may lead to deeper 
searches. However, because of differences in tie-breaking rules in the rest of the 
code, ZHANGl-C often does find better tours than ZHANGl - roughly about as 
often as it finds worse ones. Thus future implementers should be able to obtain 
similar quality tours so long as they follow the algorithm description given above 
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and break ties as they see fit. If running time is an issue, however, care should 
be exercised in the implementation of the algorithm to solve the AP. Because 
it uses a differently implemented algorithm for solving the assignment problem, 
ZHANGl-C is unfortunately a factor of 2 to 3 times slower than ZHANGl. 

Note: All our implementations differ significantly from the algorithm called 
“truncated branch-and-bound” and studied in EEing. The latter algorithm is 
allowed to backtrack if the current matching has no viable children, and will 
keep running until it encounters a viable matching that is a tour or runs out 
of branch-and-bound tree nodes to explore. For some of our test instances, this 
latter process can degenerate into almost a full search of the branch and bound 
tree, which makes this approach unsuitable for use as a “fast” heuristic. 

2.5 Local Search: 3-Opt 

Our local search algorithms work by first constructing a Nearest Neighbor tour, 
and then trying to improve it by various forms of tour rearrangement. In 3-Opt, 
the rearrangement we consider breaks the tour into three segments Si^ S 2 , S 3 by 
deleting three edges, and then reorders the segments as 82 , 81 , S 3 and relinks 
them to form a new tour. If the new tour is shorter, it becomes our new current 
tour. This is continued until no such improvement can be found. Our imple- 
mentation follows the schema described for the STSP in in sequentially 

choosing the endpoints of the edges that will be broken. We also construct near- 
neighbor lists to speed (and possibly limit) the search, and exploit “don’t-look” 
bits to avoid repeating searches that are unlikely to be successful. Because of 
space limitations, this preliminary report will not report on results for 3-Opt 
itself, although it does cover the much more effective “iterated” algorithm based 
on 3-Opt, to be described below. 

2.6 Local Search: Kanellakis-Papadimitriou Variants 

Note that local search algorithms that hope to do well for arbitrary ATSP in- 
stances cannot reverse tour segments (as is done in many STSP heuristics), 
only reorder them. Currently the ultimate “segment-reordering” algorithm is the 
Kanellakis-Papadimitriou algorithm lEESDI, which attempts to the extent pos- 
sible to mimic the Lin-Kernighan algorithm for the STSP |l,K731ITFTMTlll.llVm7j . 
It consists of two alternating search processes. 

The first process is a variable-depth search that tries to find an improving k- 
opt move for some odd fc > 3 by a constrained sequential search procedure mod- 
eled on that in Lin-Kernighan. As opposed to that algorithm, however, [kM)] 
requires that each of the odd h-opt moves encountered along the way must have a 
better “gain” than its predecessor. (In the Lin-Kernighan algorithm, the partial 
moves must all have positive gain, but gain is allowed to temporarily decrease 
in hopes that eventually something better will be found.) 

The second process is a search for an improving 4-Opt move, for which 
K&P used a potentially C(V^) algorithm which seemed to work well in prac- 
tice. The original paper on Lin-Kernighan for the STSP also suggested 
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finding 4-Opt moves of this sort as an augmentation to the sequential search 
process (which was structurally unable to find them), but concluded that they 
were not worth the added computation time incurred. In our implementation of 
Kanellakis-Papadimitriou, we actually find the best 4-Opt move in time 
using a dynamic programming approach suggested by IGloDtil . This makes the 
use of 4-Opt moves much more cost-effective; indeed they are necessary if one is 
to get the best tours for a given investment of running time using a Kanellakis- 
Papadimitriou variant. 

Our implementation also strengthens the sequential search portion of the 
original K&P: We use neighbor lists and don’t-look bits to speed the sequential 
search. We also take the Lin-Kernighan approach and allow temporary decreases 
in the net gain, and have what we believe are two improved versions of the search 
that goes from an h-opt move to an (h-|- 2)-opt move, one designed to speed the 
search and one to make it more extensive. The basic structure of the algorithm 
is to perform sequential searches until no improving move of this type can be 
found, followed by a computation of the best 4-opt move. If this does not improve 
the tour we halt. Otherwise we perform it and go back to the sequential search 
phase. Full details are postponed to the final paper. 

We have studied four basic variants on Kanellakis-Papadimitriou, with names 
constructed as follows. We begin with KP. This is followed by a “4” if the 4-opt 
search is included, and an F if the more extensive sequential search procedure 
is used. For this preliminary report we concentrate on KP4 applied to Nearest 
Neighbor starting tours, which provides perhaps the best tradeoff between speed, 
tour quality, and robustness. 

2.7 Iterated Local Search 

Each of the algorithms in the previous two sections can be used as the engine 
for a “chained” or “iterated” local search procedure as proposed by 
to obtain significantly better tours. (Other ways on improving on basic local 
search algorithms, such as the dynamic programming approach of [SPflfij do not 
seem to be as effective, although hybrids of this approach with iteration might 
be worth further study.) In an “iterated” procedure, one starts by running the 
basic algorithm once to obtain an initial champion tour. Then one repeats the 
following process some predetermined number of times: 

Apply a random 4-opt move to the current champion, and use the re- 
sulting tour as the starting tour during another run of the algorithm. If 
the resulting tour is better than the current champion, declare it to be 
the new champion. 

Typically don’t-look bits persist from one iteration to the next, which means 
that only 8 vertices are initially available as starting points for searches, which 
offers significant speedups. 

In our implementations we choose uniformly from all 4-opt moves. Better 
performance may be possible if one biases the choice toward “better” 4-opt 
moves, as is done in the implementation of chained Lin-Kernighan for the STSP 
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by Applegate et al. We leave the study of such potential improvements 

to future researchers, who can use our results as a guide. 

We denote the iterated version of an algorithm A by iA, and in this report will 
concentrate on Witeration iKP4F and lOWiteration i3opt. The former is the 
variant that tends to find the best tours while still running (usually) in feasible 
amounts of time and the latter is typically the fastest of the variants and was 
used in the code optimization application described in IV.IKMbl . Searches for 
the best 4-opt move in iKP4F occur only on the first and last iterations, so as to 
avoid spending Q{N‘^) on each of the intermediate iterations. 

2.8 STSP Algorithms 

For the three instance classes we consider that are actually symmetric, we also 
include results for the Johnson-McGeoch implementations j.TFMFi,i riM^ ofLin- 
Kernighan (LK) and 7V-iteration iterated Lin-Kernighan (iLK). These were ap- 
plied to the symmetric representation of the instance, either as a upper-triangular 
distance matrix or, where the instances were geometric, as a list of city coordi- 
nates. 



2.9 Lower Bound Algorithms 

We have already described in the introduction how Held-Karp lower bounds HK 
and optimal solutions OPT were calculated using Concorde. Here we only wish 
to point out that while the times reported for HK are accurate, including both 
the time to compute the NP-completeness transformation and to run Concorde 
on the result, the times reported for OPT are a bit flakier, as they only include 
the time to run Concorde on the already-constructed symmetric instance with 
an initial upper bound taken from the better of ZHANGl and iKP4nn. Thus a 
conservative measure of the running time for OPT would require that we increase 
the time reported in the table by both the time for the better of these two 
heuristics and the time for HK. In most cases this is not a substantial increase. 

All running times reported for OPT were measured in this way. For some 
of the larger size symmetric instances, we present tour length results without 
corresponding running times, as in these cases we cheated and found optimal 
tour lengths by applying Concorde directly to the symmetric representation 
with an upper bound supplied by iLK. 



3 Instance Generators 

In this section we describe and introduce shorthand names for our 12 instance 
generators, all of which were written in C, as well as our set of “real-world” 
instances. 
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3.1 Random Asymmetric Matrices (amiat) 

Our random asymmetric distance matrix generator chooses each distance d{ci, Cj) 
as an independent random integer x, 0 < x < 10®. Here and in what follows, 
“random” actually means pseudorandom, using an implementation of the shift 
register random number generator of [IkmiSij 

For these instances it is known that both the optimal tour length and the 
AP lower bound approach a constant (the same constant) as A — >■ oo. The 
rate of approach appears to be faster if the upper bound U on the distance 
range is smaller, or if the upper bound is set to the number of cities N, a 
common assumption in papers about optimization algorithms for the ATSP (e.g., 
see iMPMlfcTyr^ i. Surprisingly large instances of this type can be solved 
to optimality, with |MP96| reporting the solution of a 500,000-city instance. 
(Interestingly, the same code was unable to solve a 35-city instance from a real- 
world manufacturing application.) 

Needless to say, there are no known practical applications of asymmetric 
random distances matrices or of variants such as ones in which d{ci, Cj) is chosen 
uniformly from the interval [0,t -I- _)], another popular class for optimizers. We 
include this class to provide a measure of comparability with past results, and 
also because it provides one of the stronger challenges to local search heuristics. 

3.2 Random Asymmetric Matrices 
Closed under Shortest Paths (tmat) 

One of the reasons the previous class is uninteresting is the total lack of correla- 
tion between distances. Note that instances of this type are unlikely to obey the 
triangle inequality and so algorithms designed to exploit the triangle inequality, 
such as the Repeated Assignment algorithm of IbO.M.yJI will perform horribly 
on them. A first step toward obtaining a more reasonable instance class is thus 
to take a distance matrix generated by the previous generator and close it under 
shortest path computation. That is, if d{ci,Cj) > d{ci,Ck) + d{ck,Cj) then set 
d{ci,Cj) = d{ci,Ck) + d{ck,Cj) and repeat until no more changes can be made. 
This is also a commonly-studied class. 

3.3 Random Symmetric Matrices (smat) 

For this class, d{ci, cj) is an independent random integer a:, 0 < a; < 10® for each 
pair 1 < i < j < A, and d(ci, Cj) is set to d(cj, Ci) when i > j. Again, there is no 
plausible application, but these are also commonly studied and at least provide 
a ground for comparison to STSP algorithms. 

3.4 Random Symmetric Matrices 
Closed under Shortest Paths (tsmat) 

This class consists of the instances of the previous class closed under shortest 
paths so that the triangle inequality holds, another commonly studied class for 
ATSP codes. 
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3.5 Random Two-Dimensional Rectilinear Instances (rect) 

This is our final class of symmetric instances that have traditionally been used to 
test ATSP codes. It is a well-studied case of the STSP, providing useful insights 
in that domain. The cities correspond to random points uniformly distributed 
in a 10® by 10® square, and the distance is computed according to the rectilinear 
metric. We use the rectilinear rather than the more commonly-used Euclidean 
metric as this brings the instances closer to plausible ATSP applications, as 
we shall see below. In the STSP, experimental results for the Euclidean and 
rectilinear metrics are roughly the same [LIBMRj . 



3.6 Tilted Drilling Machine Instances with Additive Norm (rtilt) 

These instances correspond to the following potential ATSP application. One 
wishes to drill a collection of holes on a tilted surface, and the drill is moved 
using two motors. The first moves the drill to its new x-coordinate, after which 
the second moves it to its new y-coordinate. Because the surface is tilted, the 
second motor can move faster when the y-coordinate is decreasing than when it 
is increasing. Our generator places the holes uniformly in the 10® by 10® square, 
and has three parameters: Ux is the multiplier on \Ax\ that tells how much time 
the first motor takes, Uy is the multiplier on \Ay\ when the direction is up, and 
Uy is the multiplier on \Ay\ when the direction is down. 

Note that the previous class can be generated in this way using = u+ = 
Uy = 1. For the current class, we take = 1, Uy = 2, and u~ = 0. Assuming 
instantaneous movement in the downward direction may not be realistic, but it 
does provide a challenge to some of our heuristics, and has the interesting side 
effect that for such instances the AP- and HK-bounds as well as the optimal tour 
length are all exactly the same as if we had taken the symmetric instance with 
Ux = Uy = Uy = 1. This is because in a cycle the sum of the upward movements 
is precisely balanced by the sum of the downward ones. 

3.7 Tilted Drilling Machine Instances with Snp Norm (stilt) 

For many drilling machines, the proper metric is the maximum of the times to 
move in the x and y directions rather than the sum. For this generator, holes 
are placed as before and we have the same three parameters, although now the 
distance is the maximum of Ua,|Z\a;| and Uy\Ay\ (downward motion) or Uy\Ay\ 
(upward motion). We choose the parameters so that downward motion is twice 
as fast as horizontal motion and upward motion is half as fast. That is we set 
Ux = 2, Uy = 4, and u~ = 1. 

3.8 Random Euclidean Stacker Crane Instances (crauie) 

In the Stacker Crane Problem one is given a collection of source-destination pairs 
Si,di in a metric space where for each pair the crane must pick up an object at 
location Si and deliver it to location di. The goal is to order these tasks so as 
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to minimize the time spent by the crane going between tasks, i.e., moving from 
the destination of one pair to the source of the next one. This can be viewed as 
an ATSP in which city Ci corresponds to the pair Si,di and the distance from 
Ci to Cj is the metric distance between di and Sj. (Technically, the goal may 
be to find a minimum cost Hamiltonian path rather than a cycle, but that is a 
minor issue.) Our generator has a single parameter u > 1, and constructs its 
source-destination pairs as follows. 

First, we pick the source s uniformly from the 10® by 10® square. Then 
we pick two integers x and y uniformly and independently from the interval 
[— 10 ®/m, 10®/w]. The destination is then the vector sum s + (x,y). To preserve 
a sense of geometric locality, we want the typical destination to be closer to its 
source than are all but a bounded number of other sources. Thus, noting that 
for an A^-city instance of this sort the expected number of sources in a 10®/-\/]V 
by 10®/-\/]V is 1, we generated our instances using u as approximately y/n. In 
particular we took u = 10,18,32,56 for N = 100,316,1000,3162. Preliminary 
experiments suggested that if we had simply fixed u at 10, the instances would 
have behaved more and more like random asymmetric ones as N got larger. 

Note that these instances do not necessarily obey the triangle inequality 
(since the time for traveling from source to destination is not counted) , although 
there are probably few large violations. 

3.9 Disk Drive Instances (disk) 

These instances attempt to capture some of the structure of the problem of 
scheduling the read head on a computer disk, although we ignore some techni- 
calities, such as the fact that tracks get shorter as one gets closer to the center 
of the disk. This problem is similar to the stacker crane problem in that the files 
to be read have a start position and an end position in their tracks. Sources are 
generated as before, but now the destination has the same y-coordinate as the 
source. To determine the destination’s cc-coordinate, we generate a random inte- 
ger X G [0, 10®/u] and add it to the x-coordinate of the source, but do so modulo 
10®, thus capturing the fact that tracks can wrap around the disk. The distance 
from a destination to the next source is computed based on the assumption that 
the disk is spinning in the x-direction at a given rate and that the time for mov- 
ing in y direction is proportional to the distance traveled (a slightly unrealistic 
assumption given the need for acceleration and deceleration) at a significantly 
slower rate. To get to the next source we first move to the required y-coordinate 
and then wait for the spinning disk to deliver the x-coordinate to us. For our 
instances in this class, we set m = 10 and assumed that y-direction motion was 
10 times slower than a;-direction motion. Full details are postponed to the final 
paper. Note that this is another class where the triangle inequality need not be 
strictly obeyed. 
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3.10 Pay Phone Coin Collection Instances (coins) 

These instances model the problem of collecting money from pay phones in a 
grid-like city. We assume that the city is a fc by fc grid of city blocks. The pay 
phones are uniformly distributed over the boundaries of the blocks. Although 
all the streets (except the loop surrounding the city) are two-way, the money 
collector can only collect money from pay phones on the side of the street she 
is currently driving on, and is not allowed to make “U-turns” either between or 
at corners. This problem becomes trivial if there are so many pay phones that 
most blocks have one on all four of their sides. Our class is thus generated by 
letting k grow with n, in particular as the nearest integer to 10\/fV. 



3.11 No- Wait Flowshop Instances (shopBO) 

The no- wait flowship was the application that inspired the local search algorithm 
of Kanellakis and Papadimitriou. In a fc-processor no-wait flowshop, a job u 
consists of a sequence of tasks {ui,U 2 , ■ ■ ■ , Uk) that must be worked on by a fixed 
sequence of machines, with processing of starting on machine * -I- 1 as soon 
as processing of Ui is complete on machine i. This models situations where for 
example we are processing heated materials that must not be allowed to cool 
down, or where there is no storage space to hold waiting jobs. 

In our generator, the task lengths are independently chosen random integers 
between 0 and 1000, and the distance from job v to job u is the minimum 
possible amount be which the finish time for Uk can exceed that for Vk if u is 
the next job to be started after v. A version of this class with k = 5 processors 
was studied in |/ha03Lf/ha00| . but for randomly generated instances with this 
few processors the AP bound essentially equals the optimal tour length, and 
even PATCH averaged less than 0.1% above optimal. To create a bit more of an 
algorithmic challenge in this study, we increased the number of processors to 50. 



3.12 Approximate Shortest Common Superstring Instances (super) 

A very different application of the ATSP is to the shortest common superstring 
problem, where the distance from string A to string B is the length of B minus 
the length of the longest prefix of B that is also a suffix of A. Unfortunately, 
although this special case of the ATSP is NP-hard, real-world instances tend 
to be easy |(Iiaj and we were unable to devise a generator that produced non- 
trivial instances. We thus have modeled what appears to be a harder problem, 
approximate shortest common superstring. By this we mean that we allow the 
corresponding prefixes and suffixes to only approximately match, but penalize 
the mismatches. In particular, the distance from string A to string B is the 
length of B minus max{j -|- 2fc: there is a prefix of B of length j that matches 
a sufhx of A in all but k positions}. Our generator uses this metric applied to 
random binary strings of length 20. 
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3.13 Specific Instances: TSPLIB and Other Sources (realworld) 

In addition to our randomly-generated instance classes, and as a sanity check 
for our results on those classes, we have also tested a variety of specific “real- 
world” instances from TSPLIB and other sources. Our collection includes the 
27 ATSP instances currently in TSPLIB plus 20 new instances from additional 
applications. The full list is given in Table E] but here is a summary of the 
sources and applications involved. 

The TSPLIB instances are as follows: The four rbg instances come from a 
stacker crane application. The two ft instances arose in a problem of optimally 
sequencing tasks in the coloring plant of a resin production department, as de- 
scribe in . The 17 ftv instances, described in fKrvfldifFWT) . come from 

a pharmaceutical delivery problem in downtown Bologna, with instances ftv90 
through ftvieo derived from ftvl70 by deleting vertices as described in ETna. 
Instances ry48p and krol24p are symmetric Euclidean instances rendered asym- 
metric by slight random perturbations of their distance matrices as described 
in Instance p43 comes from a scheduling problem arising in chemical 

engineering. Instance brl7 is from an unknown source. 

Our additional instances come from five sources. big702 models an actual 
(albeit outdated) coin collection problem and was provided to us by Bill Cook. 
The three td instances came from a detailed generator constructed by Bruce 
Hillyer of Lucent for the problem of scheduling reads on a specific tape drive, 
based on actual timing measurements. The nine dc instances come from a table 
compression application and were supplied by Adam Buchsbaum of AT&T Labs. 
The two code instances came from a code optimization problem described in 
|Y.TKS97j . The five atex instances come from a robotic motion planning problem 
and were provided to us by Victor Manuel of the Carlos III University of Madrid. 

4 Results and Conclusions 

Our experiments were performed on what is today a relatively slow machine: a 
Silicon Graphics Power Challenge machine with 196 Mhz MIPS RIOOOO proces- 
sors and 1 Megabyte 2nd level caches. This machine has 7.6 Gigabytes of main 
memory, shared by 31 of the above processors. Our algorithms are sequential, so 
the parallelism of the machine was exploited only for performing many individ- 
ual experiments at the same time. For the randomized algorithms NN, RA+, KP4, 
iSopt, IKP4F, the results we report are averages over 5 or more runs for each 
instance (full details in the final report). 

Ta.blesO through rTTTlnresent average excesses over the HK bound and running 
times in user seconds for the algorithms we highlighted in Section 2 and the 
testbeds we generated using each of our 12 random instance generators. In each 
table, algorithms are ordered by their average tour quality for instances of size 
1000. Our first conclusion is that for each of the classes, at least one of our 
algorithms can find a fairly good solution quickly. See Table E which for each of 
the classes lists the algorithm that gets the best results on 1000-city instances 
in less than 80 user seconds. Note that in all cases, at least one algorithm is able 



46 



J. Cirasella et al. 



to find tours within 9% of the HK bound within this time bound, and in all but 
three cases we can get within 2.7%. There is, however, a significant variety in the 
identity of the winning algorithm, with ZHANGl the winner on five of the classes, 
iKP4F in four, iSopt in two, and KP4 in one. If time is no object, iKP4F wins 
out over the other local search approaches in all 12 classes. However, its running 
time grows to over 19,000 seconds in the case of the 1000-city rtilt and shopBO 
instances, and in the latter case it is still bested by ZHANGl. 



Table 1. For the 1000-city instances of each class, the algorithm producing the best 
average tour among those that finish in less than 80 seconds, the average percent by 
which that tour exceeds the HK bound, and the average running times. In addition 
we list the average percentage shortfall of the AP bound from the HK bound, and the 
average symmetry and triangle inequality metrics as defined in the text below. 



Class 


Winner 


% Excess 


Seconds 


% HK-AP 


Symmetry 


Triangle 


amat 


ZHANGl 


.04 


19.8 


.05 


.9355 


.163 


tmat 


ZHANGl 


.00 


6.4 


.03 


.9794 


1.000 


smat 


1KP4F 


5.26 


67.0 


19.70 


1.0000 


.163 


tsmat 


iSopt 


2.42 


40.7 


17.50 


1.0000 


1.000 


rect 


1KP4F 


2.20 


43.5 


17.16 


1.0000 


1.000 


rtilt 


KP4 


8.33 


38.7 


17.17 


.8760 


1.000 


stilt 


iSopt 


4.32 


75.3 


14.60 


.9175 


1.000 


crane 


1KP4F 


1.27 


60.8 


5.21 


.9998 


.934 


disk 


ZHANGl 


.02 


16.6 


.34 


.9990 


.646 


coins 


1KP4F 


2.66 


43.7 


14.00 


.9999 


1.000 


shopSO 


ZHANGl 


.03 


50.7 


.15 


.8549 


1.000 


super 


ZHANGl 


.21 


52.9 


1.17 


.9782 


1.000 



The table also includes three instance measurements computed in hopes of 
finding a parameter that correlates with algorithmic performance. The first is 
the percentage by which the AP bound falls short of the HK bound. The second 
is a measure of the asymmetry of the instance. For this we construct the “sym- 
metrized” version I' of an instance I with distances d!{ci,Cj) = d'{cj,Ci) set to 
be the average of d{ci,Cj) and d{cj,Ci). Our symmetry metric is the ratio of the 
standard deviation of d'{ci,Cj), i yf j, to the standard deviation for d{ci,Cj), 
i yf J. A value of 1 implies the original graph was symmetric. Our third mea- 
sure attempts to quantify the extent by which the instance violates the triangle 
inequality. For each pair Ci,Cj of distinct cities we let d'(ci,Cj) be the minimum 
of d{ci,Cj) and min{(i(cj, c^) + d{ck,Cj) : 1 < fc < N}. Our measure is then the 
average, over all pairs Ci,Cj, of d'{ci,Cj)/d{ci,Ci). A value of 1 implies that the 
instance obeys the triangle inequality. 

Based on the table, there is a clear correlation between the presence of a 
small HK-AP gap and the superiority of ZHANGl, but no apparent correlations 
with the other two instance metrics. Note that when the HK-AP gap is close 
to zero and ZHANGl is the winner, it does extremely well, never being worse 
than .21% above the HK bound. A plausible explanation is that when the AP 
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bound is close to the HK bound it is also close to optimal, which means that 
an optimal tour is not much longer (and hence perhaps not very different) from 
a minimum cycle cover. Algorithms such as ZHANGl (and RA+ and PATCH) that 
start by computing a minimum cycle cover are thus likely to perform well, and 
ZHANGl, by doing a partial branch-and-bound search for a tour, is most likely to 
do best. Conversely, when the HK-AP gap is large, ZHANGl is at a disadvantage, 
for example producing tours that are more than 11% above the HK bound for 
classes rect, rtilt, stilt, and coins. (See Tables^ Q El and m 

Here are some more tentative conclusions based on the results reported in 
Tables El through El and additional experiments we have performed. 

1. Simple ATSP tour construction heuristics such as Nearest Neighbor (NN) and 
Greedy (GR) can perform abysmally, with NN producing tours for some classes 
that are over 300% above the HK-bound. (GR can be even worse, although 
PATCH is only really bad for symmetric random distance matrices.) 

2. Reasonably good performance can be consistently obtained from algorithms 
in both local search and cycle-patching classes: KP4nn averages less than 
15.5% above HK for each instances size of all twelve of our instance classes 
and ZHANGl averages less than 13% above HK for each. Running times for 
both are manageable when N < 1000, with the averages for all the 1000-city 
instance classes being 8 minutes or less for both algorithms. 

3. Running time (for all algorithms) can vary widely depending on instance 
class. For 3162 cities, the average time for ZHANGl ranges from 71 to roughly 
22,500 seconds. For KP4nn the range is a bit less extreme: from 43 to 2358. 

4. For the instance classes yielding the longer running times, the growth rate 
tends to be substantially more explosive for the Zhang variants than for the 
local search variants, suggesting that the latter will tend to be more usable as 
instance size increases significantly beyond 3162 cities. As an illustration, see 
Figures Q and El which chart the change in solution quality and running time 
as N increases for four of our instance classes. Running times are normalized 
by dividing through by N'^, the actual size of the instance and hence a lower 
bound on the asymptotic growth rate. Although these charts use the same 
sort of dotted line to represent both local search algorithms KP4 and iKP4F, 
this should be easy to disambiguate since the former always produces worse 
tours in less time. Similarly, the same sort of solid line is used to represent 
both AP-based algorithms PATCH and ZHANGl, with the former always pro- 
ducing worse tours more quickly. By using just two types of lines, we can 
more clearly distinguish between the two general types of algorithm. Figure 
[D covers classes amat and super, both with small HK-AP gaps, although 
for the latter the gap may be growing slightly with N as opposed to declin- 
ing toward 0 as it appears to do for classes arniat, tmat, disk, and shop50. 
The tour quality chart for arniat is typical of these four, with both PATCH and 
ZHANGl getting better as N increases, and both KP4 and 1KP4F getting worse. 
The difference for class super is that PATCH now gets worse as N increases, 
and ZHANGl does not improve. As to running times (shown in the lower two 
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charts of the figure), KP4 seems to be running in roughly quadratic time (its 
normalized curve is fiat), whereas iKP4F is not only slower but seems to have 
a slightly higher running time growth rate. ZHANGl has the fastest growth 
rate, substantially worse than quadratic, and on the super class has already 
crossed over with iKP4F by 1000 cities. Figure 0 covers two classes where 
the HK-AP gap is substantial, and we can see marked differences from the 
previous classes. 

5. As might be expected, the Repeated Assignment Algorithm (RA+) performs 
extremely poorly when the triangle inequality is substantially violated. It 
does relatively poorly for instances with large HK-AP gaps. And it always 
loses to the straightforward PATCH algorithm, even though the latter provides 
no theoretical guarantee of quality. 

6. Although the Kanellakis-Papadimitriou algorithm was motivated by the no- 
wait fiowshop application, both it and its iterated variants are outperformed 
on these by all the AP-based algorithms, including RA+. 

7. For all instances classes, optimal solutions are relatively easy to obtain for 
100-city instances, the maximum average time being less than 30 minutes 
per instance (less than 6 minutes for all but one class). For 7 of the 12 classes 
we were able to compute optimal solutions for all our test instances with 
1000 cities or less, never using more than 5 hours for any one instance. For 
all classes the time for optimization was an order of magnitude worse than 
the time for ZHANGl, but for three classes (tmat, disk, and shopSO) it was 
actually faster than that for iKP4F. 

8. The currently available ATSP heuristics are still not as powerful in the ATSP 
context as are the best STSP heuristics in the STSP context. The above times 
are far in excess of those needed for similar performance levels on standard 
instance classes for the STSP. Moreover, for symmetric instances, our ATSP 
codes are easily bested by ones specialized to the STSP (both in the case of 
approximation and of optimization). 

9. It is difficult to reproduce sophisticated algorithm implementations exactly, 
even if one is only interested in solution quality, not running time. Although 
our two implementations of ZHANGl differ only in their tie-breaking rules and 
one rarely-invoked halting rule, the quality of the tours they produce can 
differ significantly on individual instances, and for some instance classes one 
or the other appears to dominate its counterpart consistently. Fortunately, 
we can in essence be said to have “reproduced” the original results in that 
the two implementation do produce roughly the same quality of tours overall. 

10. The task of devising random instance generators that produce nontrivial in- 
stances (ones for which none of the heuristics consistently finds the optimal 
solution) is a challenge, even when one builds in structure from applications. 
One reason is that without carefully constraining the randomization, it can 
lead you to instances where the AP bound quickly approaches the optimal 
tour length. A second reason, as confirmed by some of our real-world in- 
stances, is that many applications do give rise to such easy instances. 
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Fig. 1. Tour quality and running time as a function of the number of cities for classes 
amat and super and algorithms KP4, 4KP4F, PATCH, and ZHANGl. The same type lines 
are used for the local search algorithms KP4 and iKP4F and for the AP-based algorithms 
PATCH and ZHANGl, but the correspondence between line and algorithm should be clear: 
Algorithm iKP4F is always better and slower than KP4, and algorithm ZHANGl is always 
better and slower than PATCH. 
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Table 2. Results for class amat: Random asymmetric instances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


NN 


195.23 


253.97 


318.79 


384.90 


.05 


.46 


4.8 


50 


RA+ 


86.60 


100.61 


110.38 


134.14 


.10 


1.25 


26.3 


621 


iSOPT 


5.10 


13.27 


27.12 


45.53 


.75 


4.67 


36.9 


273 


KP4 


5.82 


6.95 


8.99 


11.51 


.06 


.65 


6.0 


63 


PATCH 


10.95 


6.50 


2.66 


1.88 


.04 


.40 


4.8 


42 


1KP4F 


.56 


.74 


1.29 


2.43 


.51 


5.75 


76.7 


1146 


ZHANG 1 


.97 


.16 


.04 


.04 


.06 


.84 


19.8 


694 


OPT 


.29 


.08 


.02 


- 


24.73 


194.58 


1013.6 


- 


HK 


.00 


.00 


.00 


.00 


1.05 


7.80 


95.0 


1650 


MATCH 


-0.65 


-0.29 


-0.04 


-0.04 


.04 


.40 


4.6 


40 



Table 3. Results for class tmat: amat instances closed under the shortest paths. 





Percent above HK 




Time in 


Seconds | 


Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


NN 


38.20 


37.10 


37.55 


36.66 


.05 


.44 


4.4 


46 


KP4 


1.41 


2.23 


3.09 


4.03 


.17 


1.16 


10.9 


115 


13QPT 


.30 


.85 


1.63 


2.25 


1.53 


7.08 


40.1 


231 


RA+ 


4.88 


3.10 


1.55 


.46 


.09 


1.13 


20.2 


653 


1KP4F 


.09 


.14 


.41 


.65 


8.87 


56.17 


489.4 


4538 


PATCH 


.84 


.64 


.17 


.00 


.04 


.39 


4.7 


68 


ZHANG 1 


.06 


.01 


.00 


.00 


.05 


.49 


6.4 


71 


OPT 


.03 


.00 


.00 


- 


6.16 


20.29 


91.9 


- 


HK 


.00 


.00 


.00 


.00 


1.15 


8.03 


113.3 


2271 


MATCH 


-0.34 


-0.16 


-0.03 


.00 


.04 


.39 


4.6 


62 



Table 4. Results for class smat: Random symmetric instances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA+ 


693.90 


2263.56 


7217.91 


24010.30 


.10 


1.21 


10.0 


703 


PATCH 


126.39 


237.59 


448.48 


894.17 


.04 


.40 


4.9 


64 


NN 


139.28 


181.18 


233.10 


307.11 


.05 


.46 


4.8 


50 


13QPT 


8.43 


15.57 


25.55 


39.84 


.75 


4.73 


36.2 


285 


KP4 


10.59 


11.50 


13.13 


15.31 


.06 


.64 


5.8 


62 


ZHANG 1 


5.52 


5.84 


5.97 


5.67 


.12 


4.01 


200.4 


9661 


1KP4F 


3.98 


4.34 


5.26 


6.21 


.52 


5.44 


67.0 


1007 


LK 


1.72 


2.22 


3.43 


4.68 


.18 


1.75 


11.0 


29 


ILK 


.15 


.42 


1.17 


2.35 


.45 


5.22 


31.4 


258 


OPT 


.10 


.04 


.01 


- 


16.30 


378.09 


1866.5 


- 


HK 


.00 


.00 


.00 


.00 


1.44 


12.90 


267.4 


7174 


MATCH 


-19.83 


-20.73 


-19.66 


-20.21 


.04 


.39 


4.6 


56 
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Table 5. Results for class tsmat: smat instances closed under shortest paths. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA+ 


32.75 


40.97 


47.35 


49.44 


.10 


1.22 


26.3 


557 


PATCH 


14.46 


19.37 


23.68 


26.42 


.04 


.36 


3.7 


45 


NN 


23.64 


23.18 


22.83 


22.15 


.04 


.44 


4.4 


47 


KP4 


3.12 


3.71 


4.15 


4.71 


.23 


1.43 


10.9 


122 


ZHANGl 


3.49 


4.06 


3.38 


3.51 


.17 


6.82 


373.2 


16347 


iSDPT 


1.12 


1.88 


2.42 


3.18 


2.05 


8.45 


40.7 


244 


LK 


.62 


1.17 


1.50 


2.09 


.41 


2.68 


10.8 


42 


1KP4F 


.91 


1.19 


1.44 


1.98 


20.69 


88.26 


479.1 


5047 


ILK 


.12 


.27 


.44 


.85 


8.12 


42.19 


135.7 


696 


OPT 


.10 


.12 


.11 


- 


28.00 


549.04 


12630.3 


- 


HK 


.00 


.00 


.00 


.00 


1.83 


15.15 


217.8 


885 


MATCH 


-17.97 


-19.00 


-17.50 


-17.92 


.04 


.35 


3.4 


37 



Table 6. Results for class rect: Random 2-dimensional rectilinear instances. Optima 
for 316- and 1000-city instances computed by symmetric code. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA-^ 


64.02 


73.93 


78.90 


85.72 


.10 


1.14 


20.2 


494 


NN 


26.39 


27.51 


26.11 


26.55 


.06 


.47 


4.9 


51 


PATCH 


17.90 


18.73 


18.46 


19.39 


.04 


.37 


3.9 


45 


ZHANGl 


9.75 


12.40 


12.23 


12.19 


.18 


7.78 


466.4 


22322 


KP4 


5.11 


5.06 


5.00 


5.17 


.06 


.55 


5.2 


53 


iSOPT 


2.02 


2.57 


2.66 


3.09 


.66 


3.95 


29.6 


292 


iKP4F 


1.75 


2.16 


2.20 


2.32 


.48 


4.41 


43.5 


483 


LK 


1.32 


1.55 


1.77 


1.92 


.07 


.26 


.7 


1 


iLK 


.69 


.72 


.79 


.82 


.42 


2.47 


17.8 


109 


OPT 


.68 


.67 


.68 


- 


94.82 


- 


- 


- 


HK 


.00 


.00 


.00 


.00 


1.93 


14.51 


205.0 


1041 


MATCH 


-20.42 


-17.75 


-17.17 


-16.84 


.04 


.36 


3.6 


37 



Table El presents results for our realworld testbed. For space reasons we 
restrict ourselves to reporting the three instance parameters and the results for 
ZHANGl and for iKP4F, our local search algorithm that produces the best tours. 
Here we report the excess over the optimal tour length rather than over the HK 
bound, as we were able to determine the optimal tour length for 46 of the 47 
instances. (In some cases this took less time than it took to run iKP4F, although 
dc895 took almost 20 hours.) Instances are grouped by class, within classes they 
are ordered by number of cities, and the classes are ordered roughly in order of 
increasing HK-AP gap. 

A first observation is that, true to form, ZHANGl does very well when the HK- 
AP gap is close to zero, as is the case for the rbg stacker crane instances, the td 
tape drive instances, and the dc table compression instances. Surprisingly, it also 
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Table 7. Results for class rtilt: Tilted drilling machine instances with additive norm. 
Optima for 316- and 1000-city instances computed by symmetric code applied to equiv- 
alent symmetric instances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA+ 


61.95 


73.47 


78.27 


82.03 


.13 


1.90 


49.9 


1424 


NN 


28.47 


28.28 


27.52 


24.60 


.06 


.47 


5.0 


51 


PATCH 


17.03 


18.91 


18.38 


19.39 


.05 


.50 


7.4 


127 


iSOPT 


1.69 


4.22 


12.97 


18.41 


4.61 


43.59 


254.3 


967 


ZHANG 1 


9.82 


12.20 


11.81 


11.45 


.19 


7.86 


460.8 


22511 


KP4 


6.06 


7.35 


8.33 


9.68 


.41 


3.87 


38.7 


312 


iKP4F 


1.80 


4.12 


7.29 


8.89 


38.90 


1183.94 


19209.0 


145329 


OPT 


.68 


.67 


.68 


- 


152.05 


- 


- 


- 


HK 


.00 


.00 


.00 


.00 


2.06 


15.98 


226.6 


875 


MATCH 


-20.42 


-17.75 


-17.17 ■ 


-16.84 


.05 


.49 


7.1 


116 



Table 8. Results for class stilt: Tilted drilling machine instances with sup norm. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA+ 


55.79 


62.76 


65.03 


71.48 


.10 


1.10 


22.1 


566 


NN 


30.31 


30.56 


27.62 


24.42 


.05 


.54 


4.9 


112 


PATCH 


23.33 


22.79 


23.18 


24.41 


.04 


.44 


5.6 


69 


ZHANG 1 


10.75 


13.99 


12.66 


12.86 


.17 


7.39 


423.7 


9817 


KP4 


8.57 


8.79 


8.80 


8.15 


.14 


1.01 


7.7 


73 


i30PT 


3.29 


3.95 


4.32 


4.28 


1.38 


9.38 


75.3 


601 


iKP4F 


3.00 


3.54 


3.96 


4.13 


7.17 


161.40 


4082.4 


72539 


OPT 


1.86 


- 


- 


- 


1647.46 


- 


- 


- 


HK 


.00 


.00 


.00 


.00 


1.86 


15.53 


174.5 


864 


MATCH 


-18.41 


-14.98 


-14.65 ■ 


-14.04 


.04 


.42 


5.2 


62 



Table 9. Results for class crane: Random Euclidean stacker crane instances. 





Percent above HK 


Time in Seconds I 


Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA-r 


40.80 


50.33 


53.60 


54.91 


.10 


1.20 


18.1 


590 


NN 


40.72 


41.66 


43.88 


43.18 


.05 


.46 


4.8 


50 


PATCH 


9.40 


10.18 


9.45 


8.24 


.04 


.38 


4.0 


53 


KP4 


4.58 


4.45 


4.78 


4.26 


.07 


.57 


5.3 


52 


ZHANG 1 


4.36 


4.29 


4.05 


4.10 


.10 


3.52 


172.6 


7453 


i30PT 


1.98 


2.27 


1.95 


2.12 


.67 


3.99 


31.2 


270 


iKP4F 


1.46 


1.79 


1.27 


1.36 


.66 


6.33 


60.8 


696 


OPT 


1.21 


1.30 


- 


- 


216.54 


11290.42 


- 


- 


HK 


.00 


.00 


.00 


.00 


1.43 


12.80 


925.2 


1569 


MATCH 


-7.19 


-6.34 


-5.21 


-4.43 


.04 


.38 


4.0 


44 






54 



J. Cirasella et al. 



Table 10. Results for class disk: Disk drive instances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


NN 


96.24 


102.54 


115.51 


161.99 


.06 


.48 


5.0 


54 


RA+ 


86.12 


58.27 


42.45 


25.32 


.11 


1.66 


48.9 


1544 


KP4 


2.99 


3.81 


5.81 


9.17 


.07 


.99 


21.2 


760 


iSDPT 


.97 


2.32 


3.89 


5.29 


.73 


5.66 


62.4 


687 


iKP4F 


.56 


.48 


.96 


1.77 


.64 


22.62 


1240.9 


61655 


PATCH 


9.40 


2.35 


.88 


.30 


.04 


.47 


7.5 


176 


ZHANGl 


1.51 


.27 


.02 


.01 


.08 


1.00 


16.6 


247 


OPT 


.24 


.06 


.01 


- 


22.66 


70.99 


398.2 


- 


HK 


.00 


.00 


.00 


.00 


1.27 


8.71 


139.1 


4527 


MATCH 


-2.28 


-0.71 


-0.34 


-0.11 


.04 


.46 


7.2 


168 



Table 11. Results for class coins: Pay phone coin collection instances. 





Percent above HK 


Time in Seconds | 


Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


RA+ 


52.74 


64.95 


68.78 


71.20 


.09 


1.00 


16.3 


329 


NN 


26.08 


26.71 


26.80 


25.60 


.05 


.42 


4.4 


46 


PATCH 


16.48 


16.97 


17.45 


18.20 


.04 


.32 


3.5 


41 


ZHANGl 


8.20 


11.03 


11.14 


11.42 


.14 


6.88 


435.9 


22547 


KP4 


5.74 


6.59 


6.15 


6.34 


.06 


.49 


4.7 


48 


iSOPT 


2.98 


3.37 


3.48 


3.83 


.64 


3.71 


28.3 


257 


iKP4F 


2.71 


2.99 


2.66 


2.87 


.52 


4.23 


43.7 


542 


OPT 


1.05 


1.49 


- 


~ 


356.69 


67943.26 


- 


- 


HK 


.00 


.00 


.00 


.00 


1.66 


13.75 


141.9 


924 


MATCH 


-15.04 


-13.60 


-13.96 


-13.09 


.03 


.31 


3.1 


33 



does well on many instances with large HK-AP gaps, even ones with gaps larger 
than 97%. Indeed, on only two instances is it worse than 3.5% above optimal, 
with its worst performance being roughly 11% above optimal for ft53. Note that 
ZHANGl beats iKP4F more often than it loses, winning on 26 of the 47 instances 
and tying on an additional 6 (although many of its wins are by small amounts). 
If one is willing to spend up to 33 minutes on any given instance, however, then 
iKP4F is a somewhat more robust alternative. It stays within 4.12% of optimal 
for all 47 instances and within 1.03% for all but two of them. 

5 Future Work 

This is a preliminary report on an ongoing study. Some of the directions we are 
continuing to explore are the following. 

1. Running Time. In order to get a more robust view of running time growth 
for the various algorithms, we are in the process of repeating our experi- 
ments on a variety of machines. We also intend to compute detailed counts 
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Table 12. Results for class shop50: No-wait flowshop instances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


NN 


16.97 


14.65 


13.29 


11.87 


.04 


.42 


6.4 


46 


i30PT 


.49 


4.02 


9.23 


10.76 


14.85 


67.91 


266.5 


940 


KP4 


1.57 


2.92 


3.88 


4.54 


3.12 


31.43 


306.1 


2358 


iKP4F 


.49 


2.36 


3.66 


4.51 


321.70 


2853.11 


19283.1 


74617 


RA+ 


4.77 


2.77 


1.69 


1.05 


.15 


3.03 


74.4 


2589 


PATCH 


1.15 


.59 


.39 


.24 


.05 


.86 


21.7 


611 


ZHANG 1 


.20 


.08 


.03 


.01 


.09 


1.83 


50.7 


1079 


OPT 


.05 


.02 


.01 


- 


82.03 


565.69 


2638.2 


- 


HK 


.00 


.00 


.00 


.00 


2.29 


18.22 


269.9 


3710 


MATCH 


-0.50 


-0.22 


-0.15 


-0.07 


.05 


.85 


21.6 


590 



Table 13. Results for class super: Approximate shortest common superstring in- 
stances. 





Percent above HK 




Time in 


Seconds 




Alg 


100 


316 


1000 


3162 


100 


316 


1000 


3162 


NN 


8.57 


8.98 


9.75 


10.62 


.04 


.38 


3.9 


43 


RA+ 


4.24 


5.22 


6.59 


8.34 


.02 


.84 


5.4 


393 


PATCH 


1.86 


2.84 


3.99 


6.22 


.03 


.34 


4.5 


67 


KP4 


1.05 


1.29 


1.59 


2.10 


.05 


.44 


4.4 


45 


i30PT 


.28 


.61 


1.06 


1.88 


.59 


3.19 


20.2 


138 


iKP4F 


.13 


.15 


.28 


.52 


.26 


2.40 


26.5 


323 


ZHANG 1 


.27 


.17 


.21 


.43 


.06 


1.10 


52.9 


2334 


OPT 


.05 


.03 


.01 


- 


10.74 


130.40 


4822.2 


- 


HK 


.00 


.00 


.00 


.00 


1.02 


8.14 


104.6 


2112 


MATCH 


-1.04 


-1.02 


-1.17 


-1.61 


.03 


.33 


4.3 


61 



of some of the key operations involved in the algorithms, in hopes of spotting 
correlations between these and running times. 

2. Variants on Zhang’s Algorithm and on Kanellakis-Papadimitriou. 

As mentioned in Section 2, there are several variants on Zhang’s algorithm 
not covered in this report, and we want to study these in more detail in 
hopes of determining how the various algorithmic choices affect the tradeoffs 
between tour quality and running time. A preliminary result along this line 
is that ZHANG2, which patches all viable children, typically does marginally 
better than ZHANG 1, although at a cost of doubling or tripling the overall 
running time. Similar issues arise with our local search algorithms, where we 
want to learn more about the impact of the 4-Opt moves in the Kanellakis- 
Papadimitriou algorithm, and the differences in behavior between our two 
versions of sequential search. 
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Table 14. Results for realworld test instances. 



Class 


N 


% 

HK-AP 


Symm. 


Triangle 


% Above Opt 
iKP4 ZHANGl 


Time in Seconds 
iKP4 ZHANGl 


rbg358 


358 


.00 


.9517 


.5749 


.93 


.00 


105.16 


.20 


rbg323 


323 


.00 


.9570 


.6108 


.45 


.00 


118.84 


.17 


rbg403 


403 


.00 


.9505 


.5511 


.12 


.00 


249.29 


.54 


rbg443 


443 


.00 


.9507 


.5642 


.11 


.00 


310.88 


.54 


big702 


702 


.00 


.9888 


1.0000 


.28 


.00 


394.43 


.61 


tdlOO.l 


101 


.00 


.9908 


.9999 


.00 


.00 


10.93 


.01 


td316.10 


317 


.00 


.9910 


.9998 


.00 


.00 


428.07 


.12 


tdl000.20 


1001 


.00 


.9904 


.9994 


.00 


.00 


319.88 


.88 


dcll2 


112 


.87 


.9996 


1.0000 


.36 


.13 


175.75 


.20 


dcl26 


126 


3.78 


.9999 


.9983 


.73 


.21 


298.72 


.25 


dcl34 


134 


.36 


.9973 


1.0000 


.54 


.18 


143.32 


.23 


del 76 


176 


.81 


.9982 


1.0000 


.61 


.10 


279.00 


.31 


dcl88 


188 


1.22 


.9997 


1.0000 


.70 


.22 


154.38 


.28 


dc563 


563 


.33 


.9967 


1.0000 


.75 


.09 


1113.92 


19.86 


dc849 


849 


.09 


.9947 


1.0000 


.61 


.00 


1921.31 


51.06 


dc895 


895 


.68 


.9994 


1.0000 


.60 


.42 


1938.23 


111.65 


dc932 


932 


2.48 


.9999 


.9998 


.29 


.13 


1826.41 


85.68 


codel98 


198 


.09 


.7071 


.9776 


.00 


.00 


28.76 


.03 


code253 


253 


17.50 


.7071 


.9402 


.00 


.29 


71.92 


.30 


krol24p 


100 


5.61 


.9992 


.9724 


.13 


3.31 


.21 


.06 


ry48p 


48 


12.40 


.9994 


.9849 


3.97 


2.19 


.07 


.01 


ft53 


53 


14.11 


.8739 


1.0000 


.74 


10.96 


1.64 


.02 


ft70 


70 


1.75 


.9573 


1.0000 


.07 


.47 


1.56 


.02 



3. Other Patching Algorithms. We intend to more completely characterize 
the advantages of Zhang’s algorithm versus the best of the simpler patching 
algorithms by performing more head-to-head comparisons. 

4. Starting Tours for Local Search. This is an important issue. We chose 
to use Nearest Neighbor starting tours as they provided the most robust 
results across all classes. However, for several classes we could have obtained 
much better results for KP4 and 1KP4F had we used Greedy or PATCH starts, 
and for those classes where ZHANGl is best but still leaves some room for 
improvement, it is natural to consider using ZHANGl as our starting heuristic. 
Preliminary results tell us that there is no universal best starting algorithm, 
and moreover that the ranking of heuristics as to the quality of their tours is 
for many instance classes quite different from their ranking as to the quality 
of the tours produced from them by local optimization. 

5. More, and More Portable, Generators. Our current instance generators 
use a random number generator that does not produce the same results 
on all machines. For the final paper we plan to rerun our experiments on 
new test suites created using a truly portable random number generator. 
This will help confirm the independence of our conclusions from the random 
number generator used and will also provide us with a testbed that can be 
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Table 14. (continued) 



Class 


N 


% 

HK-AP 


Symm. 


Triangle 


% Above Opt 
iKP4 ZHANGl 


Time in Seconds 
iKP4 ZHANGl 


ftv33 


34 


7.85 


.9838 


1.0000 


4.12 


3.50 


.08 


.01 


ftv35 


36 


5.24 


.9823 


1.0000 


.14 


1.10 


.07 


.00 


ftv38 


39 


5.04 


.9832 


1.0000 


.14 


1.06 


.05 


.00 


ftv44 


45 


4.03 


.9850 


1.0000 


.27 


.00 


.07 


.00 


ftv47 


48 


5.53 


.9832 


1.0000 


.13 


.68 


.11 


.01 


ftv55 


56 


9.41 


.9853 


1.0000 


.00 


.76 


.11 


.02 


ftv64 


65 


4.79 


.9850 


1.0000 


.24 


.00 


.16 


.01 


ftv70 


71 


7.49 


.9844 


1.0000 


.00 


.00 


.14 


.03 


ftv90 


91 


5.77 


.9849 


1.0000 


.00 


.45 


.27 


.01 


ftvlOO 


101 


5.52 


.9873 


1.0000 


.07 


.00 


.36 


.02 


ftvllO 


111 


4.18 


.9872 


1.0000 


.54 


.00 


.51 


.04 


ftvl20 


121 


4.93 


.9877 


1.0000 


.68 


.00 


1.10 


.05 


ftvl30 


131 


3.61 


.9884 


1.0000 


.23 


.00 


.75 


.08 


ftvl40 


141 


4.12 


.9883 


1.0000 


.41 


.00 


.66 


.04 


ftvl50 


151 


3.17 


.9887 


1.0000 


.67 


.00 


.72 


.04 


ftvl60 


161 


3.29 


.9890 


1.0000 


.82 


.11 


.71 


.07 


ftvl70 


171 


3.10 


.9890 


1.0000 


.80 


.36 


.74 


.10 


brl7 


17 


100.00 


.9999 


.8474 


.00 


.00 


.06 


.01 


p43 


43 


97.36 


.7565 


.8965 


.01 


.05 


5.15 


.02 


atexl 


16 


98.23 


.8569 


1.0000 


.00 


6.07 


.04 


.00 


atex3 


32 


98.37 


.9498 


1.0000 


.05 


.41 


.32 


.02 


atex4 


48 


96.92 


.9567 


1.0000 


.08 


.13 


.83 


.02 


atex5 


72 


97.49 


.9629 


1.0000 


.18 


.40 


1.51 


.06 


atex8 


600 


97.69 


.9895 


1.0000 


< 1.03 


< 2.85 


351.88 


47.43 



distributed to others by simply providing the generators, seeds, and other 
input parameters. We also hope to add more instance classes and grow our 
real world test set. 
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Abstract. The Internet is currently composed of thousands of networks 
(autonomous systems) containing several hundred thousand routers. It 
increases in size at an exponential pace and in a decentralized, adhoc 
manner. This combination of size, rapid growth and decentralization 
poses tremendous challenges to an outside observer attempting to un- 
derstand the internal behavior of the Internet. As the observer will have 
access to only a very small fraction of components in the interior of the 
network, it is crucial to develop methodologies for network tomography. 
i.e., identifying internal network performance characteristics based on 
end-to-end measurements. 

In this talk, we overview the nascent held of network tomography. We 
illustrate problems and issues with examples taken from our MINC (mul- 
ticast inference of network characteristics) project [Q. Briefly, the MINC 
project relies on the use of end-to-end multicast loss and delay measure- 
ments for a sequence of active probes to infer loss and delay character- 
istics on the distribution tree between a sender and a set of receivers. 
Thus, given a set of potential senders and receivers, two ingredients are 
required. The hrst, statistical in nature, is a methodology for inferring 
the internal characteristics of the distribution tree for a single sender and 
set of receivers. The second, algorithmic in nature, is a methodology for 
choosing an optimal set of trees that can be used to infer the internal 
characteristics of the network. 

Last, we will attempt to identify challenging algorithmic problems through- 
out the talk and the rich set of avenues available for investigate the imple- 
mentation and performance of different network tomography algorithms 
through experiments over the Internet. 
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Abstract. This paper investigates the questions of what statistical in- 
formation about a memory request sequence is useful to have in making 
page replacement decisions. Our starting point is the Markov Request 
Model for page request sequences. Although the utility of modeling page 
request sequences by the Markov model has been recently put into doubt 
(P3)i that two previously suggested algorithms (Maximum Hit- 

ting Time m and Dominating Distribution m) which are based on 
the Markov model work well on the trace data used in this study. Inter- 
estingly, both of these algorithms perform equally well despite the fact 
that the theoretical results for these two algorithms differ dramatically. 
We then develop succinct characteristics of memory access patterns in 
an attempt to approximate the simpler of the two algorithms. Finally, 
we investigate how to collect these characteristics in an online manner 
in order to have a purely online algorithm. 



1 Introduction 

Motivated by increasing gaps at all levels in the hierarchy from CPU speed to 
network I/O speed, researchers have investigated page replacement algorithms 
in the hopes of improving upon the time-honored Least-Recently-Used (LRU). 
LRU is a purely online algorithm in that it requires no information about the 
future requests in order to make a decision about page replacement. Many of the 
recent approaches to page replacement policies have sought to use some extra 
information about the future request sequence in order to guide page replacement 
algorithms. 

The assumption that the application has perfect knowledge about its page 
request behavior has been the starting point of recent work related to reducing 
memory access time. The issue of how to partition memory between different 
applications sharing a common cache has been examined by ^ and Both 
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these papers assume that each individual application has perfect knowledge of 
its request sequence and only the interleaving of these request sequences is un- 
known in advance. Cao, Felten, Karlin and Li investigate the question of how 
to integrate page replacement and prefetching decisions so as to optimize total 
memory access time 0 . They also use perfect knowledge of the request sequence 
in their algorithms. 

While offline algorithms are a very reasonable starting point for these investi- 
gations, an important next step is to address the question of how to use imperfect 
information about a request sequence and what are reasonable models for par- 
tial knowledge. A very natural approach to modeling imperfect information is 
to use a statistical model for request access sequences which can be gathered on 
previous executions of the program or in an online fashion as the execution of a 
program progresses. 

The goals of this paper are two-fold: the first set of experiments are designed 
to empirically test algorithms that have been designed using the Markov model 
for paging. The Markov model has been a popular method of modeling paging 
request sequences in recent theoretical work pam. We use traces data to sim- 
ulate the behavior of each algorithm on real paging sequences and report the 
fault rate as the amount of main memory varies. We also give the fault rate of 
the optimal offline algorithm as a point of comparison. Interestingly, although 
recent work m suggests that the Markov Request Model may not be an ac- 
curate model for paging sequences, we find that both algorithms in this study 
which were developed with the Markov model in mind perform near optimal. 

Unfortunately, these Markov-based algorithms are too complex to implement 
in practice. Therefore, the second goal of this paper is to investigate the question 
of what statistical information about a memory request sequence is useful to have 
in making page replacement decisions. We use the Markov-based algorithms as 
a starting place and develop more practical algorithms which approximate the 
behavior of these algorithms. Again, we use trace data to empirically evaluate 
the algorithms we develop, reporting the number of faults incurred by each 
algorithms as the amount of main memory varies. 

The remainder of Section H describes related work and reviews some of the 
results on the Markov model for paging. Section El describes the set of trace files 
which we use in this empirical investigation. This paper focuses on the memory 
hierarchy between main memory and disk. Here the unit of memory is a page. It 
would also be very useful to extend this study to examine the memory hierarchy 
between the cache and main memory. In this case, the unit of memory would 
be a cache line. The traces we use are detailed enough for studying cache line 
replacement as well. 

In Section El we discuss our implementation of two Markov-based algorithms 
described in Section 1 1 .21 a,nd present our empirical results for these algorithms. 
In Section 0 we develop succinct characteristics of memory request sequences 
which significantly improve paging performance on the set of trace files. The 
characteristics we develop are based on MHT, the simpler of the two algorithms, 
but require maintaining only a small amount of information with each page. 
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Finally, in Section 0 we investigate how to collect these characteristics in an 
online manner in order to have a purely online algorithm. 

1.1 Related Work 

A popular recent approach to page replacement policies has been to provide 
mechanisms for the applications programmer to provide information about the 
request sequence using knowledge about the program !i8libli0| . Although these 
schemes can be effective, they place an additional burden on the programmer to 
provide this extra information. Recently, Glass and Cao have devised a scheme 
which switches between LRU and MRU depending on whether the preceding 
references indicate that the program is referencing a long sequence of contiguous 
virtual memory addresses 0. 

The literature of computer performance modeling and analysis contains much 
work on statistical approaches to page replacement, both theoretical and empiri- 
cal. Shedler and Tung m were the first to propose a Markovian model of locality 
of reference. Spirn ca gives a comprehensive survey of models for program be- 
havior, focusing on semi-Markovian methods for generating sequences exhibiting 
locality of reference. Denning |7] (and references therein) develops the working 
set model of program behavior for capturing locality of reference. A recent ap- 
proach similar to the Markov model is the access-graph model in which patterns 
in the reference streams are characterized using a graph-theoretical model |3II2|. 
These methods require more extensive data about the program, although a re- 
cent empirical study of Fiat and Rosen |H] indicates that some of these ideas can 
be successfully simplified. 

Recently, Liberatore has conducted an empirical investigation of the validity 
of the Markov Request Model for paging sequences His results use several 
different measures to show that the Markov Request Model is not an accurate 
model for paging sequences. His study is based on the traces used in the study 
by Fiat and Rosen 0. We choose not to include those traces in this study. The 
main reason for this choice is that the Fiat-Rosen traces are only composed of the 
subsequence of requests to instruction pages. As a result, the number of distinct 
pages is so small that the total amount of memory accessed in any individual 
trace is never more than 352 kB which would clearly fit into main memory. While 
these traces may have been appropriate for the studies in which they were used, 
we felt that our study was better suited to a set of traces which represent a more 
complete account of the total memory access patterns. 

1.2 A First-Order Markov Model for Request Sequences 

Karlin, Phillips and Raghavan examine the question of how to devise provably 
good page replacement algorithms under the assumption that page request se- 
quences are generated by first-order Markov chains HH. The state space is the 
set of all pages referenced in the course of the program execution. If a memory 
request sequence is generated by a first order Markov chain, the probability that 
a page is referenced at a given point in the request sequence depends only on the 
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previous request. Seen another way, given that the reference is to a particular 
page p, the distribution over the next request is determined. 

If a page replacement algorithm knows the Markov chain which generates 
a memory request sequence, it is theoretically possible to determine the opti- 
mal online algorithm which will minimize the expected number of faults El. 
Unfortunately, determining this strategy requires solving a linear program with 
pages, where n is the number of pages referenced in the course of the 
program’s execution and k is the number of pages which fit in the cache. Since 
this is clearly infeasible, they turn to devising an approximation algorithm whose 
expected page fault rate can be proven to be within a constant factor of optimal. 

This paper contains a series of negative results which show that a number 
of intuitively reasonable algorithms do not attain this theoretical goal. Most of 
these algorithms are designed to use the knowledge provided by the Markov 
chain to approximate the behavior of the optimal offline algorithm. The optimal 
offline algorithm, devised by Belady 0, will evict the page whose next reference 
is farthest in the future on a fault. The approximation of the optimal which we 
use here, called Maximum Hitting Time (MHT), determines the expected next 
request of each of the pages in the cache and evicts the page whose expected 
next request time is farthest in the future. Although this algorithm in the worst 
case can be as bad as il(k) times the optimal, it proves to be very effective in our 
empirical study. Note that a lower bound entails showing a single Markov chain 
for which the expected fault rate of MHT is Q{k) times that of the optimal online 
algorithm. It does not necessarily mean that MHT will perform that badly on 
all or even typical Markov sequences. Unfortunately, MHT is not very practical 
since it requires determining for each pair of pages p and q, the expected time 
to travel in the Markov chain from p to q which requires 0{n^) space. 

A subsequent paper by Lund, Phillips, and Reingold m addresses more 
general distributions over memory request sequences. They show an algorithm 
called the dominating distribution algorithm (DD) which is conceptually very 
elegant. The algorithm is provably close to optimal in its expected fault rate, 
but impractical to implement. At each point in time that a fault occurs, the 
algorithm determines for each pair of pages p and q in memory, the probability 
that p is requested before q. Given these values, it is possible to determine 
what is called a dominating distribution over the set of pages in memory. A 
dominating distribution has the property that if a page r is chosen according to 
this distribution, then for any other page in memory p, the probability that p is 
requested before r is at most 1/2. They prove that when the page to be evicted is 
chosen according to the dominating distribution, the expected number of faults 
is within a factor of four of optimal. The algorithm is impractical for two reasons: 
if the request sequence is generated by a Markov chain, then for each ordered 
triplet of pages q, p and r, we must have determined the probability that r is 
requested before p given that the last request was to page q. This is O(n^) amount 
of data to maintain. The other reason for the impracticality of the algorithm is 
that determining the dominating distribution algorithm requires solving a linear 
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program in k variables. Furthermore, this linear program must be solved at every 
page fault. 

Despite the fact that recent work indicates that first-order Markov chains 
may not model page request sequences very accurately (ISl), the two algorithms 
described here work very well on the traces used in this study. Although these 
algorithms are not practical page replacement algorithms, they give important 
intuition for developing more feasible algorithms. 



2 Trace Methodology 

We use I/O traces collected from a number of applications by Cao, Felten, and Li 
0. They instrumented the Ultrix 4.3 kernel to collect the traces. When tracing 
is turned on, file I/O system calls for every process are recorded in a log. They 
gathered traces for a number of applications. The applications are 

dinero dinero simulating about 8 MB of cc trace. 

cscope cscope searching for eight C symbol names in about 

18 MB of kernel source. 

glimpse glimpse searching for 5 sets of keywords in about 40 MB 

of newsgroup text. 

Id Id building the Ultrix 4.3 kernel from about 25 MB of 

object files. 

sort sorting numerically on the first field of a 200,000 line, 

17 MB text file. 

postgres join postgres joining 200,000 tuple and 20,000 tuple 
relational databases. 

postgres select postgres doing a selection. 

rendering 3-D volume rendering software rendering 22 slices with 
stride 10. 

For more detailed information on these applications, see 0. 

The table below shows the number of distinct pages requested and the length 
of the sequence for each of the traces. 



Trace 


Pages 


Requests 


dinero 


1972 


17715 


cscope 


2143 


17255 


glimpse 


10310 


49828 


Id 


6583 


10383 


sort 


16578 


36938 


postgres join 


8781 


27170 


postgres selection 


7265 


14410 


volume rendering 


11007 


46703 



We used a uniform page size of 4 kB throughout our study. For read and write 
system calls, the device and inode numbers of the file accessed and the integer 
part of the offset in the file divided by 4000 is stored in the linked list. If the 
count in the read or write extends over multiple pages, each of those pages is 
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stored in the linked list. Consecutive reads and writes to the same page are 
stored as one entry in the linked list. 

Each of our programs reads in a trace file and stores the relevant page infor- 
mation in a linked list. When time information is necessary, as in LRU and the 
optimal offline algorithm, a counter is incremented for each page in the linked 
list. Each algorithm can then be run on various cache sizes with the page re- 
quests in the linked list. Of course, the online algorithms do not “remember” 
anything from one cache size to another. 



2.1 Experimental Design 

In each of the studies for this paper, we evaluate different algorithms measuring 
the number of faults they incur on a given request sequence. The amount of main 
memory available is varied, to test the algorithms under different conditions. For 
simplicity, the amount of main memory is measured in terms of the number of 
memory pages it can hold. Naturally, as this number increases, all the algorithms 
improve. In each graph, we give the fault rate of the optimal offline algorithm 
for comparison. 

The fault rate versus memory size is represented as a continuous curve, al- 
though it is really the results of many discrete trials. In Figures □] and El each 
line is composed of roughly one hundred discrete points. The amount of main 
memory is increased linearly from one page to the maximum number used. For 
the Dominating Distribution algorithm, we only had computational resources to 
test the algorithm for a small number of main memory sizes on a few traces. This 
data is represented by discrete points, instead of a continuous curve. In Figures 
000 and0 the size of main memory is increased by 100 pages in each trial. 

In all the graphs, when the fault rate for a particular algorithm reaches the 
total number of distinct pages requested in the sequence, we do not perform any 
further trials with more main memory. Any algorithm must incur a fault the 
first time a page is requested, so increasing the amount of main memory beyond 
this value would not decrease the fault rate. Thus, you will notice that the curve 
that corresponds to the optimal offline algorithm is always the shortest curve 
since it reaches the minimum number of faults first. 



3 Markov-Based Algorithms 

3.1 Gathering Statistics from the Traces 

The Maximum Hitting Time algorithm requires keeping for each ordered pair of 
pages (p, q), the expected number of requests after a request to p until a request 
to q. Since this information is expensive to derive from the trace exactly (in 
terms of available memory, see Section 13.21 , we made some approximations in 
our preprocessing. A counter is kept for every ordered pair (p,q). This counter 
is always in one of two states: active or inactive. The counter starts out inactive. 
If a request to p is encountered, the counter is put in the active state if it is not 
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Number of Faults 



Number of Faults 



Fig. 1. Comparison of Markov-based algorithms with LRU and the optimal offline algo- 
rithm. The page size is 4 kB. In the dinero, cscope and Id traces, OPT and Maximum 
Hitting Time practically coincide. See Sections I’Z. 1 1 and 0 for more details on these 
experiments. 
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Number of Faults 



Number of Faults 





Number of Faults 



Number of Faults 



Fig. 2. Comparison of Markov-based algorithms with LRU and the optimal offline 
algorithm. The page size is 4 kB. See Sections iz.\l and for more details on these 
experiments. 
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Number of Faults 



Number of Faults 



Fig. 3. Comparison of algorithms which use some partial offline information. The 
page size is 4 kB. LRU and Augmented LRU practically coincide in all the traces. In 
the dinero trace, OPT, Avg Round Trip, and Conditional Avg Round Trip practically 
coincide. See Sections o and 0 for more details on these experiments. 
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Fig. 4. Comparison of algorithms which use some partial offline information. The page 
size is 4 kB. See Sections l2.il andlUfor more details on these experiments. 
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Fig. 5. Comparison of online algorithms with LRU and the optimal offline algorithm. 
The page size is 4 kB. See Sections 12. II and El for more details on these experiments. 
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Fig. 6. Comparison of online algorithms with LRU and the optimal offline algorithm. 
The page size is 4 kB. See Sections iZ. 11 and EJ for more details on these experiments. 
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already in the active state. Whenever a request to q is encountered, the counter 
is put in the inactive state if it is not already in the inactive state. While the 
counter is active, it is incremented once for every request in the sequence. The 
value of the counter is taken to be the total length of all hitting times from p 
to q. This value is divided by the number of times a request to p is followed by 
a request to q without an intervening request to p. Naturally, it is possible that 
there is never a request to p followed by a request to q. This case is noted by a 
flag indicating an infinite hitting time from p to q. Consider the example below 

xa;pxxqa;xa;pxpxxxqa;pxxpx 

The X charcters represent requests to pages other than p or q. The characters in 
bold indicate when the counter for {p, q) is active. The value of the counter at 
the end of the sequence is 13 which is the number of request which arrive while 
the counter is active. This value is divided by two to get the expected hitting 
time from p to q. Note that the divisor is the number of times the counter is 
switched from the active to the inactive state which is the same as the number 
times a request to p is followed by a request to q without an intervening request 
to p. 

For the Dominating distribution algorithm, we need to know for every triple 
of pages (p,q,r), the probability that r is requested before q after a request to 
p. A counter is maintained for this triple which can either be active or inactive. 
The counter is maintained as follows: 

— If a request to p is encountered, it is put in the active state if it is not already 
in the active state. 

— If a request to r is encountered and the counter is active, then the counter 
is incremented and put in the inactive state. 

— If a request q is encountered, the counter is put in the inactive state if it is 
not already in the inactive state. 

The value of the counter is then divided by the number of times a request to p 
is followed by a request to g or r without an intervening request to p. Consider 
the example below 

a:a;pxxra;a;a;pxqa;pxpxxxrxpxxpx 

The X characters represent requests to pages other than p, q and r. The characters 
in bold represent the requests which come in while the counter for (p, q, r) is 
active. At the end of the sequence, the counter is two, since two of the active 
subsequences end in an r. This is then divided by three to get the probability 
that r is requested before q after a request to p. Again, the divisor represents 
the number of times the counter is switched from the active to the inactive state 
which is the same as the number of times a request to p is followed by a request 
to r or g without an intervening request to p. 
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3.2 Implementation 

Simulating MHT and DD is problematic because of both space and the compu- 
tational requirements. Fortunately, the number of pages was small enough that 
each page could be identified by a short integer even for the largest page set. 
This enabled us to maintain the entire O(n^) data structure for MHT. However 
the 0{n^) space required by DD was still insurmountable. In order to reduce 
the space required by DD to O(n^), we had to reprocess the trace data for each 
page fault. Reprocessing the trace data and solving a linear program at every 
page fault made the simulation of DD a very time consuming process. We only 
simulated a few cache sizes for a few traces to spot check DD against the other 
algorithms. 

3.3 Experimental Results 

Figures [Hand El show the results of running four page replacement algorithms on 
the traces. These are: LRU, Belady’s optimal offline algorithm (OPT), Maximum 
Hitting Time (MHT) and Dominating Distribution. As you can see from the 
figures, MHT does very well on all of these traces, and DD does about as well 
as OPT and MHT for the points that we simulated. This is despite the fact that 
DD requires O(n^) data and solving a linear program at every page fault while 
MHT only requires O(n^) data and a linear search at every page fault. This 
suggests that the extra resources used by DD to improve the page fault rate is 
unnecessary for the programs used in these experiments. 

4 A More Practical Algorithm 

Motivated by the theoretical results on Markov paging, we develop page replace- 
ment policies which are more efficient, both in their time and space requirements. 
The algorithms described here are closest in spirit to the Maximum Hitting Time 
described above in Section ira As in MHT, the goal of the algorithm is to evict 
the page whose expected next request time is farthest in the future. However, 
instead of basing this expectation on the last request seen, we make a cruder 
estimate which requires only 0(1) space per page. We determine the average 
round-trip time for each page which is the average number of requests in between 
consecutive requests to that page. For each page in the cache, we estimate its 
next request time by the last time it was requested plus the expected round-trip 
time for that page. This algorithm is called Average Round-Trip Time (ARTT). 

In an offline pass of the data, we determine the average round trip time 
for each page. Pages which are only requested once are given an arbitrarily 
large round-trip time. Then when the cache is simulated, every time a page is 
requested, its estimated next request time is set to be the current time plus its 
average round-trip time. On a fault, the page with the largest estimated next 
request time is evicted. In order to prevent pages from staying in memory for a 
very long time, if a page’s estimated next request time has passed, its priority 
for eviction is set arbitrarily high. 
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The algorithm is relatively simple to implement. Each page has its average 
round trip time which it keeps throughout the execution of the program. This 
information is static. The pages in memory are kept in a priority queue according 
to their estimated next request time. One update to the priority queue must be 
performed at each memory request (although updating the priority queue could 
be done less frequently). A separate priority queue is maintained which keeps 
the pages in reverse order to detect when a page’s next request time has expired. 

A second version of this algorithm is based on the observation that typically 
a page is requested with relatively short round-trip times and then may not be 
requested for a long time. Presumably, we want to use the short round-trip time 
for the estimated next request time for the page and when this time passes, mark 
the page for eviction. Thus, we have the Conditional Average Round-Trip Time 
(CARTT) algorithm which is the same as ARTT except that it only takes the 
average of the round-trip times which are shorter than a predefined threshold. 
The threshold we use here is 3fc where k is the number of pages which can fit in 
memory. 

The three is an arbitrarily chosen small constant. Varying this value slightly 
does not significantly alter the results. The factor of k comes from the reasoning 
that under LRU, a page will sit in memory for a length of time that depends on 
the size of the cache. The fact that round-trip times have somewhat bi-modal 
distributions, indicates that the choice of threshold is not so important. It does 
begin to matter as the size of the cache approaches one of the humps in the 
distribution, which would explain the occasionally erratic behavior in CARTT 
seen below. 

We compared these two algorithms to the offline optimal and LRU as well 
as to a slightly altered version of LRU which we call Augmented-LRU. In the 
augmented LRU, we give the algorithm this single bit of offline information that 
tells whether the page ever has a round-trip time of less than three times the 
capacity of the cache. This is to determine whether the advantage that ARTT 
and CARTT gain are from this bit of information or from the more refined 
data given by the round-trip times. The results are shown in Figures 01 and El 
Although it is hard to tell in some of the figures below, Augmented-LRU and 
LRU often behave exactly the same and worse than CARTT and ARTT. This 
indicates that the round-trip time values used by CARTT and ARTT do provide 
valuable information. As expected, CARTT out-performs ARTT in almost all 
cases. Both algorithms obtain significant gains over LRU on most of the traces 
and are never significantly worse. 



5 Purely Online Algorithms 

We use the intuition from the partially offline algorithms discussed in Section 
El to develop purely online algorithms which have no prior knowledge of future 
requests except what they have seen in the past. The algorithms we describe 
here attempt to behave like CARTT except that they are forced to approximate 
the conditional average round-trip time for each page based on what they have 
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seen so far in the sequence. Each page maintains the sum of the round trip times 
under 3fc which have occurred so far and the number of these “small” round 
trips. These values are updated whenever the page is requested. Whenever the 
average round-trip time is needed, the ratio of these two values is used. 

For the first three requests to a page, we use a default value for the con- 
ditional round-trip time until more data has been gathered for that page. The 
algorithms only differ in what is used for the default value. We present data 
from two algorithms below. The first one uses a default of k. We called this one 
online :default-k. The second uses a dynamic default value which is the number 
of requests since the last request to that page. The goal here is to get the al- 
gorithm to behave in an LRU-like manner until enough statistical information 
is available to improve page replacement performance. If the time since the last 
request is always used as the estimated round-trip time, then the algorithm be- 
haves exactly as LRU. We called this algorithm online: default- LRU. We also 
tested an algorithm which set the default to a very high value, which would lead 
to quickly evicting pages which had not been requested more than three times. 
This algorithm did much worse than the other two. 

The results are shown in Figures 0, and 0 Surprisingly, the default-fc al- 
gorithm was almost always the better of the two. One reason for this is that 
the relative values of the default round-trip times and the estimated round-trip 
time may have been off so that the algorithm unfairly favored one set over the 
other. This perhaps could be adjusted by weighting the two values differently. 
The online:default-k algorithm improved upon LRU on most of the traces. Inter- 
estingly, when the default-fc was significantly worse than LRU, the default-LRU 
algorithm was better than LRU. 
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Abstract. Search engines often employ techniques for determining syn- 
tactic similarity of Web pages. Such a tool allows them to avoid returning 
multiple copies of essentially the same page when a user makes a query. 
Here we describe our experience extending these techniques to MIDI mu- 
sic files. The music domain requires modification to cope with problems 
introduced in the musical setting, such as polyphony. Our experience 
suggests that when used properly these techniques prove useful for de- 
termining duplicates and clustering databases in the musical setting as 
well. 



1 Introduction 

The extension of digital libraries into new domains such as music requires re- 
thinking techniques designed for text to determine if they can be appropri- 
ately extended. As an example of recent work in the area, Francu and Nevill- 
Manning describe designing an inverted index structure for indexing and per- 
forming queries on MIDI music files jS|. Their results suggest that by using 
additional techniques such as pitch tracking, quantization, and well-designed 
musical distance functions, organizing and querying a large music database can 
be made efficient. 

In this paper, we describe our experience extending hashing-based techniques 
designed for finding near-duplicate HTML Web documents to the problem of 
finding near-duplicate MIDI music files. These techniques are currently used by 
the AltaVista search engine to, for example, determine if an HTML page to be 
indexed is nearly identical to another page already in the index (^; they have also 
been used to cluster Web documents according to similarity 0. Our goal is to 
determine if these techniques can be effective in the setting of musical databases. 
This follows a trend in other recent work in the area of musical databases in 
trying to extend text-based techniques to musical settings ilS|. 

As an example of related work, the Humdrum Toolkit provides a suite of 
tools that may be used to estimate musical resemblance and manipulate musical 
structures in other ways. Indeed, the techniques we describe could be imple- 
mented in Humdrum, and we believe they may prove a useful addition to the 
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toolkit. We note that Humdrum uses its own proprietary format (although trans- 
lations to and from MIDI are possible). Themefinder 0 implements a musical 
database search tool using the same format as Humdrum and allows searches on 
a melody line. We instead allow comparison between entire pieces of music. 

Our results suggest that finding near-duplicates or similar MIDI pieces can be 
done efficiently using these hashing techniques, after introducing modifications to 
cope with the additional challenges imposed by the musical setting. In particular 
we address the problem of polyphony, whereas other work on musical similarity 
has largely restricted itself to single, monophonic melody lines. Applications for 
these techniques are similar to those for Web pages, as described in |Z]. We 
focus on the most natural applications of clustering and on-the-fly resemblance 
computation. 

To begin, we review important aspects of the MIDI file format and recall the 
basic framework of these hashing techniques in the context of text documents. 
We then discuss the difficulties in transferring this approach to the musical 
domain and suggest approaches for mitigating these difficulties. We conclude 
with results from our current implementation. 



1.1 Review: MIDI 

The MIDI (Musical Instrument Digital Interface) I.O specification |5| defines a 
protocol for describing music as a series of discrete note-on and note-off events. 
Other information, such as the force with which a note is released, volume 
changes, and vibrato can also be described by the protocol. A MIDI file pro- 
vides a sequence of events, where each event is preceded by timing information, 
describing when the event occurs relative to the previous event. MIDI files there- 
fore describe music at the level that sheet music does. Because of its simplicity 
and extensibility, MIDI has become a popular standard for communication of 
musical data, especially classical music. 



1.2 Review: Hashing Techniques for Text Similarity 

We review previously known techniques based on hashing for determining when 
text documents are syntactically similar. Here we follow the work of Broder |0|, 
although there are other similar treatments j9l III) . 

Each document may be considered as a sequence of words in a canonical 
form (stripped of formatting, HTML, capitalization, punctuation, etc.). A con- 
tiguous subsequence of words is called a shingle, and specifically a contiguous 
subsequence of w words is a w-shingle. For a document A and a fixed w there 
is a set S{A,w) corresponding to all ic-shingles in the document. For example, 
the set of 4-shingles ^(A, 4) of the phrase “one two three one two three one two 
three” is: 

{(one, two, three, one), (two, three, one, two), (three, one two, three)} 

(We could include multiplicity, but here we just view the shingles as a set.) 
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One measure of the resemblance of two text files A and B is the resemblance of 
their corresponding sets of shingles. We therefore define the resemblance r(A, B) 
as: 



|5(AHn^(j3,zc)| 
^ ^ \S{A,w)US{B,w)\ 



The resemblance r(A, B) is implicitly dependent on w, which is generally a pre- 
chosen fixed parameter. The resemblance is a number between 0 and 1, with 
a value of 1 meaning that the two documents have the same set of w-shingles. 
Small changes in a large document can only affect the resemblance slightly, since 
each word change can affect at most w distinct shingles. Similarly, resemblance 
is resilient to changes such as swapping the order of paragraphs. 

An advantage of this definition of resemblance is that it is easily approxi- 
mated via hashing (or fingerprinting) techniques. We may hash each w-shingle 
into a number with a fixed number of bits, using for example Rabin’s finger- 
printing function p|. From now on, when using the term shingle, we refer to the 
hashed value derived from the underlying words. For each document, we store 
only shingles that are 0 modulo p for some suitable prime p. Let L{A) be the set 
of shingles that are 0 modulo p for the document A. Then the estimated value 
of the resemblance Ve is given by: 



re{A,B) 



\L{A)f^L{B)\ 

\L{A)yjL{B)\ 



This is an unbiased estimator for the actual resemblance r{A,B). By choosing 
p appropriately, we can reduce the amount of storage for L(A), at the expense 
of obtaining possibly less accurate estimates of the resemblance. As L{A) is a 
smaller set of shingles derived from the original set, we call it a sketch of the 
document A. 

Similarly, we may define the containment of A by B, or c{A,B), by: 



c{A,B) 



|S'(A, w) n S{B, w)| 

|5(A,w)| 



Again, containment is a value between 0 and 1, with value near 1 meaning that 
most of the shingles of A are also shingles of B. In the text setting, a containment 
score near 1 suggests that the text of A is somewhere contained in the text of 
B. We may estimate containment by: 



Ce(A, B) 



\L{A)nL{B)\ 

\L{A)\ 



Again this is an unbiased estimator. 



2 Challenges in Adapting Hashing Techniques to Music 

2.1 Polyphony 

The greatest challenge in adapting hashing techniques similar to those for text 
described above is that while text is naturally represented as a sequence of bytes. 
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musical notes occur logically in parallel, not serially. Thus musical data formats 
like MIDI must arbitrarily flatten this data into a serial stream of note events. 
This creates several potential problems related to polyphony, or the playing of 
several notes concurrently, so that directly converting the MIDI representation 
into a text file and applying the above techniques is not sufficient. Much of 
the work in musical resemblance has obviated this problem by considering only 
monophonic sequences of notes like a simple melody. We attempt to deal with 
music in its entirety. 




Fig. 1. Excerpt from J.S. Bach’s “Jesu, Joy of Man’s Desiring” 



For example. Figure Q shows three measures of J.S. Bach’s “Jesu, Joy of 
Man’s Desiring” in roll-bar notation. A sequence of note events representing this 
excerpt might begin as follows: 

G1 On, G2 On, G3 On, G3 Off, A3 On, G1 OfT, G2 OfT, A3 Off, D2 On, 

G2 On... 

The events representing the bass notes are boldfaced; they are interspersed 
among the melody note events, although intuitively the two are logically separate 
sequences. As this example demonstrates, a logical group of notes such as the 
melody usually is not represented contiguously in a MIDI file, because note 
events occurring in other parts come in between. 

This situation confounds attempts to derive meaningful resemblance esti- 
mates from direct application of text resemblance techniques. For example, if 
one used the natural text representation of a MIDI file to compute the resem- 
blance or containment of some MIDI file and another MIDI file containing only 
the melody of the same piece, one could not expect to correctly identify the 
similarity between the files. Adjacent events in a MIDI file may not have any 
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logical relationship, so grouping them into shingles is no longer meaningful in 
the context of a MIDI file. 

2.2 Timing and Other Information 

Since MIDI records event timing with fine resolution, the timing of note events in 
two MIDI representations of the same music could differ, but to an imperceptible 
degree. Such insignificant variations can potentially have a significant impact on 
a resemblance estimate. MIDI files may also record a variety of musical meta- 
information and machine-specific controls not directly related to the music. 



3 Solutions 

3.1 Pitch by Pitch 

Our basic strategy for adapting to musical data is to separate out notes from 
each pitch, and fingerprint the sequences of start times for notes of each pitch 
independently. That is, a separate sketch is produced for each of the 128 possible 
note pitches in a MIDI file. Contiguous subsequences of note start times in each 
pitch are grouped into shingles, and so forth as in the text resemblance compu- 
tation described above. So, the resemblance of C3 note events in document A 
with C3 note events in document B is computed by direct application of the text 
resemblance computation, and likewise for all 128 possible pitches. A weighted 
average of these 128 resemblance computations then gives the resemblance be- 
tween A and B. 

This helps mitigate the problems of polyphony in both resemblance and 
containment computations. The notes in one pitch are more likely to belong to 
one logical group of notes, therefore it is reasonable to group notes of the same 
pitch together. At the level of a single pitch, the text resemblance computation 
is again meaningful and we can take advantage of this established technique. We 
may eliminate the problem of arbitrary ordering of simultaneous events; when 
considering one pitch, simultaneous events are exact duplicates, and one can be 
ignored. 

One may ask why we do not group notes from two or three adjacent pitches 
together, or group notes across all 128 pitches together into a single sequence. 
Our initial experimental results indicate that this hurts performance, and indeed 
grouping all pitches together yields extremely poor performance. While this ap- 
proach may be worth further study, we believe the advantages of the pitch by 
pitch approach in the face of polyphony are quite strong. 

3.2 Extension to Transpositions and Inversions 

The pitch by pitch approach also allows us to consider musical transpositions. 
Consider a musical piece in the key of C in MIDI file A, and the same piece in 
the key of C# in file B. As given, this resemblance computation would return 
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a very low resemblance between A and B, though musically they are all but 
identical. C3 notes in A are compared to C3 notes in B, but this group of notes 
really corresponds to C#3 notes in B. 

If we were to account for this by comparing C#3 notes in B to C3 notes 
in A, and so forth, we would correctly get a resemblance of 1.0. This is trivial 
using pitch by pitch sketches; we may try all possible transpositions and take 
the maximum resemblance as the resemblance between two MIDI files. While 
this certainly solves the problem, it slows the computation substantially. (There 
are up to 128 transpositions to try.) In practice we may try the few most likely 
transpositions, using information such as the number of shingles per pitch to 
determine the most likely transpositions. 

Grouping by pitch into 128 sequences is not the only possibility. For example, 
another possibility is to ignore the octave of each pitch entirely] all C pitches 
would be considered identical and all C note events would be grouped, altogether 
producing 12 sequences. Doing so potentially creates the same interference effects 
between parts that we try to avoid, however. 

On the positive side, ignoring the octave helps cope with harmonic inversions. 
For example, the pitches C, E, and G played together are called a G major chord, 
regardless of their relative order. That is, G3-E3-G3 is a G major chord, as is 
E3-G3-G4, though the latter is said to be “in inversion.” All such inversions have 
the same subjective harmonic quality as the original (“root position”) chord. 

If one made some or all the G3 notes in a piece into G4 notes (moving them all 
up by one octave), the resulting piece would be subjectively similar, yet these two 
pieces would be deemed different by the computation described above. Ignoring 
octaves clearly resolves this problem. 

These examples demonstrate the difficulty involved in both adequately defin- 
ing and calculating musical similarity at a syntactic level. In our experimental 
setup, we choose not to account for transpositions or harmonic inversions. Our 
domain of consideration is classical MIDI files on the web, and such variations 
are generally uncommon; that is, one does not find renditions of Beethoven’s 
“Moonlight Sonata” transposed to any key but its original Gff minor. However 
one could imagine other domains where such considerations would be important. 

3.3 Timing 

In text, the word is the natural base for computation. In music, the natural base 
is the note. In particular, we have chosen the relative timings of the notes as the 
natural structure, corresponding to the order of words for text documents. 

We have found that using only the start time, instead of start time and 
duration, to represent a note is the most effective. The duration, or length of 
time a note lingers, can vary somewhat according to the style of the musician, 
as well as other factors; hence it is less valuable information and we ignore it. 
Even focusing only on start times, various preprocessing steps can make the 
computation more effective. 

MIDI files represent time in “ticks,” a quantity whose duration is typically 
l/120th of a quarter-note, but may be redefined by MIDI events. We scale all 
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time values in a MIDI file as a preprocessing step so that one tick is 1/ 120th of 
a quarter-note. Given this and the fact that we ignore MIDI Tempo events, the 
same music with a different tempo should appear similar using our metric. 

All times are quantized to the nearest multiple of 60 ticks, the duration of an 
eighth-note (typically between 150 and 400 milliseconds) in order to filter out 
small, unimportant variations in timing. 

Recall that note start times are given relative to previous notes. Musically 
speaking, a time difference corresponding to four measures is in a sense the 
same as forty measures, in that they are both a long period of time, probably 
separating what would be considered two distinct blocks of notes (“phrases”). 
We therefore cap time differences at 1,920 ticks, which corresponds to 4 measures 
of 4/4 time (about 8 seconds), and do not record any shingle containing a time 
difference larger than this. Another option is to cap start times at 1,920 ticks 
but not discard any shingles. 

3.4 Extraneous Information 

MIDI files may contain a great deal of information besides the note-on and note- 
off events, all of which we consider irrelevant to the resemblance computation. 
This is analogous to ignoring capitalization and punctuation in the text domain; 
note that we could include this additional information, but we choose to ignore 
it for overall performance. 

For example, we ignore information about the instruments used and the 
author and title of the piece, and base the computation solely on the music. In 
practice such information could be used for indexing or classifying music, see e.g. 

We note here that one could certainly use this information in conjunction 
with our techniques, although it implies that one trusts the agents generating 
this information. 

Similarly we ignore track and channel numbers. MIDI files may be subdi- 
vided into multiple tracks, each of which typically contains the events describing 
one musical part (one instrument, possibly) in a complex musical piece. Track 
divisions are ignored because they have no intrinsic musical meaning and are 
not employed consistently. Notes from all tracks in a multi-track MIDI file are 
merged into a single track as a preprocessing step. Also, MIDI events are associ- 
ated with one of 16 channels to allow for directed communication on a network; 
this channel number also has no musical meaning and is ignored. 

Musically, we ignore differences in tempo, volume, and note velocity, focusing 
on the timing aspects as described above. 



3.5 The Algorithm 

Fingerprinting. For each pitch z, the note events in pitch z in a MIDI file 
are extracted and viewed as a sequence of numbers (start times). Times are 
normalized, and recall that start times are recorded as the time since the last 
event. Each subsequence of four start times is viewed as a shingle; any shingle 
containing a start time larger than 1,920 ticks is discarded. Each shingle is hashed 
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using Rabin fingerprints 0 into a 16-bit number; those shingles whose hash is 
0 modulo 19 are kept in the sketch of pitch z for the file. 

Note that increasing the number of start times per shingle decreases the 
likelihood of false matches but magnifies the effect of small variations on the re- 
semblance score. We have found experimentally that four start times per shingle 
is a reasonable compromise, although the choice is somewhat arbitrary. Three 
start times per shingle gives poorer results, but five and even six start times per 
shingle yield performance quite similar to that of four. 

The modulus 19 may be varied to taste; smaller values can increase accuracy 
of resemblance estimates at the cost of additional storage, and vice versa. 

A 16-bit hash value economizes storage yet yields a somewhat small hash 
space. Since an average MIDI file produces only about 1,200 shingles to be 
hashed, the probability of collision is small enough to justify the storage sav- 
ings. Note that using 20-bit or 24-bit hash values introduces substantial pro- 
gramming complications that will slow computation; therefore 16 bits appears 
the best practical choice, because using a full 32-bit hash value would substan- 
tially increase memory requirements. 

Resemblance and Containment. Let Az be the sketch (set of shingles) for 
pitch z in a file A. Given sketches for two files A and B, for each z we compute 
the resemblance 

lAzDBzl 

\AzUBz\^ 

just as in the text resemblance computation. A weighted average of these 128 
values gives the resemblance between A and B, where r^’s weight is (|A^|-|-|i?;,|). 
Other scaling factors are possible; this scaling factor is intuitively appealing 
in that it weights the per pitch resemblance the total number of shingles. In 
particular, this approach gives more weight to pitches with many matching notes, 
which is useful in cases where one file may only contain certain pitches (such as 
only the melody). We have found this performs well in practice, and in particular 
it performs better than using an unweighted average. 

Containment is defined analogously; the containment score for pitch z is 
instead weighted by IA 2 I. 

4 Experimental Results 

4.1 Test Data 

To our knowledge there is no large, publicly available standard corpus of MIDI 
files. Instead we developed our own corpus using five sites offering MIDI files on 
the Internet. We downloaded 14,137 MIDI files (11,198 of which were unique) for 
testing purposes. Many sites limit downloads to 100 per user per day, so custom 
robot programs were developed to obtain the MIDI files while respecting usage 
restrictions. 
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Fig. 2. Percent of note events altered vs. estimated resemblance 



4.2 Behavior of the Resemblance Score 

A musical resemblance computation is only useful insofar as it corresponds well 
to some intuitive notion of resemblance. That is, an estimated resemblance 
should allow reliable conclusions about the relationship between two MIDI files. 
Our proposed computation corresponds to the notion that two files resemble 
each other when they have a high proportion of note events in common. Other 
notions of resemblance are possible, but this is probably the most natural and 
is also the notion captured by the text resemblance computation. 

To test the accuracy of this computation, we took 25 MIDI files from the web 
and generated 100 altered MIDI files from each of them. Each note event was 
changed with probability g/100, so roughly a percentage q of the note events 
were slightly altered at random. If a note was changed, then: 

— with probability 1/4 its pitch was changed, either up or down by one pitch, 
at random. 

— with probability 1/4 it was deleted. 

— with probability 1/2 it was given a new start time, chosen uniformly at 
random between the original start time and the time of either the last event 
or an implicit event 120 ticks before, whichever was more recent. 

Although these changes do not correspond to a specific model of how differences 
in MIDI files might arise, we feel it provides a reasonable test of the system’s 
performance. 

The resemblance of each of the 2,500 instances was computed with the orig- 
inal. The distribution of the resulting resemblance scores versus q is shown in 
Figure El There is a reasonable and fairly reliable relationship between q and the 
resulting resemblance score. 

To view the results in a different way, we also consider a small test set con- 
sisting of five familiar pieces. The pieces are Beethoven’s “Moonlight Sonata,” 
First Movement; Saint-Saens’s “Aquarium” from “Carnival of the Animals;” 
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Rimsky-Korsakov’s “Flight of the Bumblebee;” Wagner’s “Bridal Chorus from 
Lohengrin;” and Schumann’s “Traumerei.” Variations on each of these five files 
were constructed, with 3%, 6%, and 9% of all notes altered. Also, files consisting 
of only the treble parts of the songs were created. As Tabled shows, there can 
be significant variance in the results depending on the piece. 



Table 1. Sample resemblance computations 





Moonlight 


Aquarium 


Flight 


Lohengrin 


Traumerei 


Treble 


0.9375 


0.4968 


0.7739 


0.8000 


0.6667 


3% 


0.9179 


0.8850 


0.8657 


1.0000 


1.0000 


6% 


0.7050 


0.7529 


0.7087 


0.8793 


0.8571 


9% 


0.6463 


0.7412 


0.5474 


0.7829 


0.4583 



4.3 Simple Clustering 

This resemblance computation’s most compelling application may be determin- 
ing when two MIDI files represent the same piece of music. We clustered our 
corpus of documents in the following way: any pair of files A and B for which 
r{A,B) exceeded some threshold t were put into the same cluster. 

We find that high thresholds {t > 0.45) all but eliminate false matches; the 
contents of nearly every cluster correspond to the same piece of music, though 
all renditions of one piece may fail to cluster together. It is very interesting 
that variations of the same piece of music can have a fairly low resemblance 
score. (See the discussion for Table 2.) This appears to be a major difference 
between the musical setting and the text setting; there appears to be a wide 
variation in what constitutes the same musical piece, while for text the syntactic 
differences among what is considered similar is much less. It may also reflect a 
partial problem with the pitch by pitch approach: a delay in one note can affect 
the relative timing of multiple pitches, so changes can have an effect on a larger 
number of shingles. 

At low thresholds {t < 0.15), with high accuracy all renditions of a piece 
of music cluster together, but different musical pieces often cluster together 
also. For instance, many (distinct) Bach fugues tend to end up the same cluster 
because of their strong structural and harmonic similarities. For such values of t 
we find that a few undesirably large clusters of several hundred files form; many 
small clusters aggregate because of a few fluke resemblances above the threshold, 
and snowball into a meaningless crowd of files. 

To gain further insight into the importance of the threshold, we did pairwise 
resemblance comparisons for all the files in our corpus. In Figure El we show 
the number of pairs with a given resemblance; this is a histogram with the 
resemblance scores rounded to the nearest thousandth. (Notice the y-axis is on 
a logarithmic scale.) The graph naturally shows that most documents will have 
low resemblance; even moderate resemblance scores may be significant. 
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Fig. 3. Relative frequency of resemblance scores from our corpus. 




We chose a compromise value oft = 0.35, for which we find that clusters tend 
to correspond well to distinct pieces of music with very few false matches; we 
present some of the results of a clustering with this threshold. Table El shows the 
site’s descriptive information for the contents of some representative clusters. 
Identical renditions (by the same sequencer), as well as renditions from different 
sequencers cluster together consistently. 

Such a clustering might help the Classical MIDI Connection (CMC) learn 
that their “Prelude No. IV in C# min” is from The Well-Tempered Clavier, or 
point out that the attribution of “Schubert-Liszt Serenade” to Liszt is possibly 
incorrect. This might also help a performer or MIDI site determine who else has 
posted (perhaps illegitimately) their own MIDI files on the web. As an example 
of the importance of the choice of threshold, consider the cluster corresponding 
to the variations of “The Four Seasons.” The pairwise resemblances of its mem- 
bers are shown in Table 13 The two variations by the sequencer Dikmen appear 
identical, so we consider the first four from the same source but from different 
sequencers. As can be seen, the same piece of music can yield very different 
shingles from different sequencers. 

Naturally, more sophisticated clustering techniques could improve perfor- 
mance. For example, we have also tried incrementally building clusters, putting 
clusters for A and B together only when the average resemblance of A with all 
members of B’s cluster, and vice versa, exceeds some threshold. We find that this 
eliminates the problematic large clusters described above, but otherwise yields 
nearly identical clusters, and that a lower threshold of t = 0.2 performs well. 

4.4 Performance 

Source code, written in C, was compiled and run on a Compaq AlphaServer 
DS20 with a 500 MHz CPU, four gigabytes of RAM, a 64 KB on chip cache and 
a 4 MB on board cache. 

Our implementation can fingerprint MIDI data at a rate of approximately 7.8 
megabytes per second (of user time). If a typical MIDI file is around 45 kilobytes 
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in size, then this amounts to producing sketches for about 174 MIDI files per 
second. A typical MIDI file’s sketch requires about 128 bytes of storage, not 
counting bookkeeping information such as the file’s URL. Approximately 3,096 
resemblances can be computed per second; this may be sped up by faster I/O, 
by sampling fewer shingles (that is, increasing p), or by searching in parallel. 
We expect that performance could also be improved by rewriting our prototype 
code. 



Table 2. Contents of some representative clusters 



File Description 


Sequencer 


Source 


Prelude No. 4 from The Well-Tempered Clavier, 


(unknown) 


GMG 


Book I (Bach) 






Prelude No. IV in C# min (Bach) 


(unknown) 


GMG 


Prelude No. 4 in C# min from The Well-Tempered 


M. Reyto 


prs.net 


Clavier, Book I (Bach) 






Variations on a Theme by Haydn (Brahms) 


J. Kaufman 


GMG 


Variations for Orchestra on a Theme from Haydn’s 


J. Kaufman 


prs.net 


St. Anthony’s Chorale (Brahms) 






The Four Seasons, No. 2 - ’Summer’ in G-, Allegro 


M. Dikmen 


prs.net 


non molto (Vivaldi) 






The Four Seasons, No. 2 - L’Estate (Summer) in G-, 


M. Reyto 


prs.net 


1. Allegro non molto (Vivaldi) 






The Four Seasons, No. 2 - ’Summer’ in G-, 2. Estate 


A. Zammarrelli 


prs.net 


(Vivaldi) 






The Four Seasons, No. 2 - ’Summer’ in G-, Allegro 


N. Sheldon Sr. 


prs.net 


non molto (Vivaldi) 






Summer from the Four Seasons 


M. Dikmen 


Glassical MIDI 


Symphony No. 3 in D (Op. 29), 2nd Mov’t 


S. Zurflieh 


GMG 


(Tchaikovsky) Symphony No. 3, 2nd Mov’t 


S. Zurflieh 


prs.net 


(Tchaikovsky) 






Symphony No. 3 Op. 29, Movt. 2 (Tchaikovsky) 


S. Zurflieh 


sciortino.net 


Schwanengesang, 4. Serenade (Schubert) 


F. Raborn 


prs.net 


Schubert-Liszt Serenade (Liszt) 


(unknown) 


Glassical MIDI 


Serenade (Schubert) 


F. Raborn 


Glassical MIDI 


Symphony No. 94 in G ‘Surprise,’ 4. Allegro molto 


J. Urban 


prs.net 


(Haydn) 






Symphony No. 94 in G ‘Surprise,’ 4. Finale: 


L. Jones 


prs.net 


Allegro molto (Haydn) 







Source URL 

CMC http: //www. midiworld, com/cmc/ 
prs.net http: //www. prs .net/midi .html 
Classical MIDI http://www.classical.btinternet.co.uk/page7.htm 
sciortino.net http: //www. sciortino.net/music/ 
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Table 3. Similarity scores from a cluster. 





Dikmen 


Reyto 


Zammarrelli 


Sheldon 


Dikmen 




0.4636 


0.3233 


0.3680 


Reyto 


0.4636 




0.3612 


0.5901 


Zammarrelli 


0.3233 


0.3612 




0.2932 


Sheldon 


0.3680 


0.5901 


0.2932 





5 Conclusions 

We believe that this musical resemblance computation represents an effective 
adaptation of established text resemblance techniques to the domain of MIDI 
files. The pitch by pitch fingerprinting approach provides a useful and sound 
framework for this adaptation, and can be easily extended to tackle more com- 
plex musical issues like transpositions and inversions. Our experiments suggest 
that computation can be used to discover near-duplicate MIDI files with a high 
degree of accuracy. Further engineering and tuning work would be useful to 
optimize this approach. 

We believe this approach may be useful in conjunction with other techniques 
for organizing and searching musical databases. An important open question is 
how these techniques can be applied to other musical formats, such as MP3. 
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Abstract. In [3] we introduced an adaptive algorithm for computing the 
intersection of k sorted sets within a factor of at most 8fc comparisons 
of the information-theoretic lower bound under a model that deals with 
an encoding of the shortest proof of the answer. This adaptive algorithm 
performs better for “burstier” inputs than a straightforward worst-case 
optimal method. Indeed, we have shown that, subject to a reasonable 
measure of instance difficulty, the algorithm adapts optimally up to a 
constant factor. This paper explores how this algorithm behaves un- 
der actual data distributions, compared with standard algorithms. We 
present experiments for searching 114 megabytes of text from the World 
Wide Web using 5,000 actual user queries from a commercial search en- 
gine. From the experiments, it is observed that the theoretically optimal 
adaptive algorithm is not always the optimal in practice, given the distri- 
bution of WWW text data. We then proceed to study several improve- 
ment techniques for the standard algorithms. These techniques combine 
improvements suggested by the observed distribution of the data as well 
as the theoretical results from [3]. We perform controlled experiments 
on these techniques to determine which ones result in improved perfor- 
mance, resulting in an algorithm that outperforms existing algorithms 
in most cases. 



1 Introduction 

In SODA 2000 [3| we proposed an adaptive algorithm for computing the in- 
tersection of sorted sets. This problem arises in many contexts, including data 
warehousing and text-retrieval databases. Here we focus on the latter applica- 
tion, specifically web search engines. In this case, for each keyword in the query, 
we are given the set of references to documents in which it occurs, obtained 
quickly by an appropriate data structure p I IblDj . Our goal is to identify those 
documents containing all the query keywords. Typically these keyword sets are 
stored in some natural order, such as document date, crawl date, or by URL 
identifier. In practice, the sets are large. For example, as of July 2000, the av- 
erage word from user query logs matches approximately nine million documents 
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on the Google web search engine. Of course, one would hope that the answer 
to the query is small, particularly if the query is an intersection. It may also be 
expected that in dealing with grouped documents such as news articles or web 
sites, one will find a large number of references to one term over a few relatively 
short intervals of documents, and little outside these intervals. We refer to this 
data nonuniformity as “burstiness.” 

An extreme example that makes this notion more precise arises in computing 
the intersection of two sorted sets of size n. In this case, it is necessary to verify 
the total order of the elements via comparisons. More precisely, for an algorithm 
to be convinced it has the right answer, it should be able to prove that it has 
the right answer by demonstrating the results of certain comparisons. At one 
extreme, if the sets interleave perfectly, I7(n) comparisons are required to prove 
that the intersection is empty. At the other extreme, if all the elements in one 
set precede all the elements in the other set, a single comparison suffices to 
prove that the intersection is empty. In between these extremes, the number of 
comparisons required is the number of “groups” of contiguous elements from a 
common set in the total order of elements. The fewer the groups the burstier the 
data. 

This example leads to the idea of an adaptive algorithm mm- Such an 
algorithm makes no a priori assumptions about the input, but determines the 
kind of instance it faces as the computation proceeds. The running time should 
be reasonable for the particular instance — not the overall worst-case. 

In the case of two sets, it is possible to obtain a running time that is roughly 
a logarithmic factor more than the minimum number of comparisons needed 
for a proof for that instance. This logarithmic factor is necessary on average. 
Intersection of several sorted sets becomes more interesting because then it is no 
longer necessary to verify the total order of the elements. Nonetheless, in 0, we 
demonstrate a simple algorithmic characterization of the proof with the fewest 
comparisons. Another difference with k > 2 sets is that there is no longer an 
adaptive algorithm that matches the minimum proof length within a roughly 
logarithmic factor; it can be necessary to spend roughly an additional factor of 
k in comparisons Pj. Although we have been imprecise here, the exact lower 
bound can be matched by a fairly simple adaptive algorithm described in |3|. 
The method proposed, while phrased in terms of a pure comparison model, is 
immediately applicable to any balanced tree (e.g., B-tree) model. 

This means that while in theory the advantages of the adaptive algorithm are 
undeniable — it is no worse than the worst-case optimal m and it does as well 
as theoretically possible — in practice the improvement depends on the burstiness 
of the actual data. The purpose of this paper is to evaluate this improvement, 
which leads to the following questions: what is a reasonable model of data, and 
how bursty is that data? 

Our results are experiments on “realistic” data, a 114-megabyte crawl from 
the web and 5,000 actual user queries made on the Google"’’^ search engine; see 
Section 0 for details. 



Experiments on Adaptive Set Intersections for Text Retrieval Systems 



93 



What do we measure of this data? We begin by comparing two algorithms 
for set intersection: the optimal adaptive algorithm from |^, and a standard 
algorithm used in some search engines that has a limited amount of adaptive- 
ness already, making it a tough competitor. We refer to the former algorithm 
as Adaptive, and to the latter algorithm as SvS, small versus small, because it 
repeatedly intersects the two smallest sets. As a measure of burstiness, we also 
compute the fewest comparisons required just to prove that the answer is correct. 
This value can be viewed as the number of comparisons made by an omniscient 
algorithm that knows precisely where to make comparisons, and hence we call it 
Ideal. It is important to keep in mind that this lower bound is not even achiev- 
able: there are two factors unaccounted, one required and roughly logarithmic, 
and the other roughly k in the worst case, not to mention any constant factors 
implicit in the algorithms. (Indeed we have proved stronger lower bounds in |3|.) 
We also implement a metric called IdealLog that approximately incorporates the 
necessary logarithmic factor. See Section 0 for descriptions of these algorithms. 

In all cases, we measure the number of comparisons used by the algorithms. 
Of course, this cost metric does not always accurately predict running time, 
which is of the most practical interest, because of caching effects and data- 
structuring overhead. However, the data structuring is fairly simple in both 
algorithms, and the memory access patterns have similar regularities, so we be- 
lieve that our results are indicative of running time as well. There are many 
positive consequences of comparison counts in terms of reproducability, specif- 
ically machine-independence and independence from much algorithm tuning. 
Comparison counts are also inherently interesting because they can be directly 
compared with the theoretical results in 0 . 

Our results regarding these algorithms (see Section 0) are somewhat surpris- 
ing in that the standard algorithm outperforms the optimal adaptive algorithm 
in many instances, albeit the minority of instances. This phenomenon seems to 
be caused by the overhead of the adaptive algorithm repeatedly cycling through 
the sets to exploit any obtainable shortcuts. Such constant awareness of all sets 
is necessary to guarantee how well the algorithm adapts. Unfortunately, it seems 
that for this data set the overhead is too great to improve performance on aver- 
age, for queries with several sets. Thus in Section 0 we explore various compro- 
mises between the two algorithms, to evaluate which adaptive techniques have 
a globally positive effect. We end up with a partially adaptive algorithm that 
outperforms both the adaptive and standard algorithms in most cases. 

2 The Data 

Because our exploration of the set-intersection problem was motivated by text 
retrieval systems in general and web search engines in particular, we tested the 
algorithm on a 114-megabyte subset of the World Wide Web using a query log 
from Google. The subset consists of 11,788,110 word^ with 515,277 different 

^ The text is tokenized into “words” consisting of alphanumerical characters; all other 
characters are considered whitespace. 
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Table 1. Query distribution in Google log. 



# keywords 


# queries in log 


2 


1,481 


3 


1,013 


4 


341 


5 


103 


6 


57 


7 


26 


8 


14 


9 


4 


10 


2 


11 


1 



words, for an average of 22.8 occurrences per word. Note that this average is in 
sharp contrast to the average number of documents containing a query word, 
because a small number of very common words are used very often. 

We indexed this corpus using an inverted word index, which lists the docu- 
ment(s) in which each term occurs. The plain-text word index is 48 megabytes. 

The query log is a list of 5,000 queries as recorded by the Google search 
engine. Queries consisting of just a single keyword were eliminated because they 
require no intersections. This reduces the query set to 3,561 entries. Of those, 
703 queries resulted in trivially empty sets because one or more of the query 
terms did not occur in the index at all. The remaining 2,858 queries were used 
to test the intersection algorithms. Table ^shows the distribution of the number 
of keywords per query. Note that the average number of keyword terms per query 
is 2.286, which is in line with data reported elsewhere for queries to web search 
engines jSj. Notice that beyond around seven query terms the query set is not 
large enough to be representative. 

The data we use is realistic in the sense that it comes from a real crawl of the 
web and a real query log. It is idisosincratic in the sense that it is a collection 
of pages, grouped by topic, time and language. Other set intersections outside 
text retrieval, or even other text-retrieval intersections outside the web, might 
not share these characteristics. 

The query log has some anomalies. First, the Google search engine does not 
search for stop words in a query unless they are preceded by ‘-I-’. This may 
cause knowledgeable users to refrain from using stop words, and thus produce 
an underrepresentation of the true frequency of stop words in a free-form search 
engine. Second, it seems that in the Google logs, all stop words have been re- 
placed by a canonical stop word ‘a’. Third, the lexically last query begins with 
‘sup’, so for example no queries start with ‘the’ (which is not a Google stop 
word) . 

Figure E shows how the different set sizes are represented in the query log. 
At the top of the chart we see a large set corresponding to the word ‘a’, which 
is very common both in the corpus and in the query log. In Figure 0 we see 
the distribution of the total set sizes of the queries. In other words, given a 
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Fig. 1. Size of each set, counted repeatedly Fig. 2. Total number of elements involved 
per query. in each query. 



query from the query log, we sum the number of elements in each of the sets in 
that query and plot this value. As it is to be expected from the sum of a set of 
random variables, the distribution roughly resembles a normal distribution, with 
the exception of those queries involving the stop word ‘a’, as discussed above. 

3 Main Algorithms 

We begin by studying three main methods in the comparison model for deter- 
mining the intersection of sorted sets. The first algorithm, which we refer to as 
SvS, repeatedly intersects the two smallest sets; to intersect a pair of sets, it 
binary searches in the larger set for each element in the smaller set. A more 
formal version of the algorithm is presented below. 

Algorithm SvS. 

— Sort the sets by size. 

— Initially let the candidate answer set be the smallest set. 

— For every other set S in increasing order by size: 

• Initially set £ to 1. 

• For each each element e in the candidate answer set: 

o Perform a binary search for e in S, between £ and ISI0 
o If e is not found, remove it from the candidate answer set. 
o Set I to the value of low at the end of the binary search. 

This algorithm is widely used in practice, and even possesses a certain amount 
of adaptivity. Because intersections only make sets smaller, as the algorithm 
progresses with several sets, the time to do each intersection effectively reduces. 
In particular, the algorithm benefits largely if the set sizes vary widely, and 

^ John Bentley (personal communication, September 2000) has pointed out that it is 
frequently more efficient to binary search between 1 and ISI all the time, because of 
similar access patterns causing good cache behavior. However, since we are working 
in the comparison model, searching from £ can only make SvS a stonger competitor. 
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performs poorly if the set sizes are all roughly the same. More precisely, the 
algorithm SvS makes at least f2(rlogn) and at most 0{nlog{n / k)) comparisons, 
where r is the size of the resulting intersection and n is the total number of 
elements over all k sets. 

The second algorithm, which we refer to as Adaptive, is the adaptive method 
proposed by the authors |^. It has two main adaptive features. The most promi- 
nent is that the algorithm takes an element (the smallest element in a particular 
set whose status in the intersection is unknown) and searches for it in each of 
the other sets “simultaneously,” and may update this candidate value in “mid 
search.” A second adaptive feature is the manner in which the algorithm per- 
forms this search. It uses the well-known approach of starting at the beginning 
of an array and doubling the index of the queried location until we overshoot. A 
binary search between the last two locations inspected completes the search for 
a total time of 21g* comparisons, where i denotes the final location inspected. 
We refer to this approach as galloping. 

A more precise description of the algorithm is the following: 

Algorithm Adaptive. 

— Initially set the eliminator to the first element of the first set. 

— Repeatedly cycle through the sets: 

• Perform one galloping step in the current set. 

• If we overshoot: 

o Binary search to identify the precise location of the eliminator, 
o If present, increase occurrence counter and output if the count reaches k. 
o Otherwise, set the new eliminator to the first element in the current set 
that is larger than the current eliminator. If no such element exists, exit 
loop. 

In 13, Adaptive is described as working from both ends of each set, but for 
simplicity we do not employ this feature at all in this work. The worst-case 
performance of Adaptive is within a factor of at most 0{k) of any intersec- 
tion algorithm, on average, and its best-case performance is within a roughly 
logarithmic factor of the “offline ideal method.” 

This last metric, which we refer to as Ideal, measures the minimum number 
of comparisons required in a proof oi the intersection computed. Recall that an 
intersection proof is a sequence of comparisons that uniquely determines the 
result of the intersection. For example, given the sorted sets {1,3} and {2}, 
the comparisons (1 < 2) and (2 < 3) form a proof of the emptiness of the 
intersection. Of course, computing the absolute smallest number of comparisons 
required takes significantly more comparisons than the value itself, but it can be 
computed in linear time as proved in Pj. 

Because any algorithm produces a proof, the smallest descriptive complexity 
of a proof for a given instance is a lower bound on the time complexity of 
the intersection of that instance. Unfortunately this descriptive complexity or 
Kolmogorov complexity is not computable, so we cannot directly use this lower 
bound as a measure of instance difficulty or burstiness. Instead we employ two 
approximations to this lower bound. 
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First observe that the number of comparisons alone (as opposed to a binary 
encoding of which comparisons) is a lower bound on the descriptive complexity 
of a proof. This is precisely the Ideal metric. The adaptive algorithm takes a 
roughly logarithmic factor more than Ideal, and it may take roughly a factor 
of k longer in the worst case. However, Ideal provides a baseline unachievable 
optimum, similar to that used in online competitive analysis. 

This baseline can be refined by computing the complexity of a description of 
this proof. Specifically, we describe a proof by encoding the compared elements, 
for each element writing the set and displacement from the previously compared 
element in that set. To encode this gap we need, on average, the log of the 
displacement value, so we term this the log-gap metric. For the purposes of this 
work we ignore the cost of encoding which sets are involved in each comparison. 

In SODA we show that a log-gap encoding is efficient, using information- 
theoretic arguments. As we mentioned above. Ideal can be found in linear time, 
yet the shortest proof even by the log-gap metric seems difficult to compute. 
One can estimate this value, though, by computing the log-gap encoding of the 
proof with the fewest comparisons. This leads to a metric called IdealLog, Ideal 
with a log gap. We cannot claim that this is the shortest proof description but 
it seems a reasonable approximation. 



4 Main Experimental Results 

Figure El shows the number of comparisons required by Ideal to show that the 
intersection is empty, and Figure 0 shows the size of the log-gap encoding of 
this proof. The integral points on the x-axis correspond to the number of terms 
per query, and the y-axis is the number of comparisons taken by either metric 
on a logarithmic scale. Within each integral gap on the x-axis is a frequency 
histogram. Each cross represents a query. The crosses are spaced horizontally 
to be vertically separated by at least a constant amount, and they are scaled 
to fill the horizontal space available. Thus, at a given vertical level, the width 
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of the chart approximates the relative frequency, and the density of the chart is 
indicative of the number of queries with that cost0 In addition, to the left of 
each histogram is a bar (interval) centered at the mean (marked with an ‘X’) and 
extending up and down by the standard deviation. This histogram/bar format 
is used in most of our charts. 

Figure 0shows the number of comparisons used by Adaptive to compute the 
intersection, with axes as in Figure 0 These values are normalized in Figure 0 
by dividing by the IdealLog metric for each query. We observe Adaptive requires 
on the average about 1 + 0.4A: times as many comparisons as IdealLog. This ob- 
servation matches the worst-case ratio of around 0{k), suggesting that Adaptive 
is wasting time cycling through the sets. 
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Fig. 5. Number of comparisons of Adap- Fig. 6. Numbers of comparisons of SvS by 
tive by terms in query. terms in query. 



10 



0.1 



Adaptive [low wins] vs. 
IdealLog [high wins] 







M 




►1 


i'f 


. f 






F 



















23456789 10 11 

Number of sets 



Two-Smallest Binary Search (SvS) [low wins] vs. 
IdealLog [high wins] 

































it't 


iff 




1 




r 






s I 


1 









2 3 4 5 



Number of sets 



9 10 11 



Fig. 7. Ratio of number of comparisons of Fig. 8. Ratio of number of comparisons of 
Adaptive over IdealLog. SvS over IdealLog. 



® Unfortunately, there is a limit to the visual density, so that, for example, the leftmost 
histograms in Figures 0 and 0 both appear black, even though the histogram in 
Figure 0is packed more tightly because of many points with value 2 (coming from 
queries with 1-element sets — see Figure QJ. 
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Figures|n|and|Hlshow the same charts for SvS, as absolute numbers of compar- 
isons and as ratios to IdealLog. They show that SvS also requires a substantially 
larger amount of comparisons than IdealLog, and for few sets (2 or 3) often more 
comparisons than Adaptive, but the dependence on k is effectively removed. In 
fact, SvS appears to improve slightly as the number of sets increases, presumably 
because more sets allows SvS’s form of adaptivity (removing candidate elements 
using small sets) to become more prominent. 

FigureO shows the ratio of the running times of Adaptive and SvS, computed 
individually for each query. Figure Ml shows the difference in another way, sub- 
tracting the two running times and normalizing by dividing by IdealLog. Either 
way, we see directly that Adaptive performs frequently better than SvS only for 
a small number of sets (2 or 3), presumably because of Adaptive’s overhead in 
cycling through the sets. SvS gains a significant advantage because the intersec- 
tion of the smallest two sets is very small and indeed often empty, and therefore 
SvS often terminates after the first pass, having only examined two sets, while 
Adaptive constantly examines all k sets. 
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Two-Smallest Binary Search ($vS) [high wins] 
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Fig. 9. Ratio of number of comparisons of 
Adaptive over SvS. 



Fig. 10. Difference in number of compar- 
isons of Adaptive and SvS, normalized by 
IdealLog. 



5 Further Experiments 

In this section we explore various compromises between the adaptive algorithm 
and SvS to develop a new algorithm better than both for any number of sets. 
More precisely, we decompose the differences between the two algorithms into 
main techniques. To measure the relative effectiveness of each of these techniques 
we examine most (though not all) of the possible combinations of the techniques. 

The first issue is how to search for an element in a set. Binary search is optimal 
when trying to locate a random element. However, in the case of computing 
an intersection using SvS (say), on the average the element being located is 
likelier to be near the front of the array. Therefore starting the search from 
the front, as galloping does, is a natural improvement. Figure fTTI confirms that 
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galloping in the second-smallest set (“half galloping”) is usually better than 
binary searching (SvS). Variations in galloping may also result in improvements; 
one simple example is increasing the galloping factor from 2 to 4. This particular 
change has no substantial effect, positive or negative, in the case of half galloping; 
see Figure ini Another natural candidate is the Hwang-Lin merging algorithm 
for optimal intersection of two random sets m- Again, comparing to the half- 
galloping method, there is no clear advantage either way; see Figure H3 



Two-Smallesl Binary Search (SvS) [low wins] vs. 




Fig. 11. Ratio of number of comparisons 
of SvS over SvS with galloping (Two- 
Smallest Half-Gallop). 



Two-Smailest Half-Gallop [low wins] vs. 
Two-Smallest Half-Gallop Accelerated [high wins] 
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Fig. 12. Ratio of number of comparisons 
of Two-Smallest Half-Gallop (factor 2) 
over Two-Smallest Half-Gallop Acceler- 
ated (factor 4). 



Two-Smallesl Half-Gallop [low wins] vs. 




Two-Smailest Half-Gallop [low wins] vs. 
Two-Smallesl Adaptive [high wins] 




Fig. 13. Ratio of number of comparisons Fig. 14. Ratio of number of comparisons 
of Two-Smallest Half-Gallop over Two- of Two-Smallest Half-Gallop over Two- 
Smallest Hwang-Lin. Smallest Adaptive. 



A second issue is that SvS sequentially scans the smallest set, and for each 
element searches for a matching element in the second-smallest set. Alternatively, 
one could alternate the set from which the candidate element comes from. In one 
step we search in the second-smallest set for the first element in the smallest set; 
in the next step we search in the smallest set for the first uneliminated element 
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in the second-smallest set; etc. This is equivalent to repeatedly applying the 
adaptive algorithm to the smallest pair of sets, and hence we call the algorithm 
Two-Smallest Adaptive. The results in Figure El show that this galloping in 
both sets rarely performs differently from galloping in just one set, but when 
there is a difference it is usually an improvement. 

A third issue is that Adaptive performs galloping steps cyclically on all sets. 
This global awareness is necessary to guarantee that the number of comparisons 
is within a factor of optimal, but can be inefficient, especially considering that 
in our data the pairwise intersection of the smallest two sets is often empty. On 
the other hand, SvS blindly computes the intersection of these two sets with 
no lookahead. One way to blend the approaches of Adaptive and SvS is what 
we call Small Adaptive. We apply a galloping binary search (or the first step of 
Hwang-Lin) to see how the first element of the smallest set fits into the second- 
smallest set. If the element is in the second-smallest set, we next see whether it 
is in the third-smallest and so on until we determine whether it is in the answer. 
(Admittedly this forces us to make estimates of the set size if we use Hwang-Lin.) 

This method does not increase the work from SvS because we are just moving 
some of the comparisons ahead in the schedule of SvS. The advantage, though, 
is that this action will eliminate arbitrary numbers of elements from sets and 
so will change their relative sizes. Most notably it may change which are the 
smallest two sets, which would appear to be a clear advantage. 

Thus, if we proceed in set-size order, examining the remaining sets, this 
approach has the advantage that the work performed is no larger than SvS, 
and on occasion it might result in savings if another set becomes completely 
eliminated. For example, if we are intersecting the sets Ai = {3, 6, 8}, A 2 = 
{4, 6, 8, 10} and A3 = {1, 2, 3, 4, 5}, we start by examining Ai and A2, we discard 
3 and 4, and identify 6 as a common element. SvS would carry on in these two sets 
obtaining the provisional result set R = {6,8} which would then be intersected 
against A3. On the other hand an algorithm that immediately examines the 
remaining sets would discover that all elements in A3 are smaller than 6 and 
immediately report that the entire intersection is empty. 

Another source of improvement from examining the remaining sets once 
a common element has been identified is that another set might become the 
smallest. For example, let Ai and A2 be as in the previous example and let 
Ag = (1, 2, 3, 4, 5, 9}. Then after reaching the common element 6 in Ai and A2 
the algorithm examines Ag and eliminates all but {9}. At this stage the remain- 
ing elements to be explored are {8} from Ai, (8, 10} from A2 and {9} from Ag. 
Now the new two smallest sets are Ai and Ag and it proceeds on these two sets. 

As a final tuning, if the two smallest sets do not change. Small Adaptive 
searches alternately between the two sets, as in Two-Smallest Adaptive. Thus 
the only difference between these two algorithms is when the two smallest sets 
intersect. On the web dataset, this happens so infrequently that the difference 
between Small Adaptive and Two-Smallest Adaptive is slight; see Figure El 
However, when Small Adaptive makes a difference it is usually an improvement, 
and there are several instances with a fairly large improvement. 
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Two-Smailest Adaptive [low wins] vs. Adaptive [low wins] vs. 




Fig. 15. Ratio of number of compar- Fig. 16. Ratio of number of comparisons 
isons of Two-Smallest Adaptive over Small of Adaptive over Small Adaptive. 
Adaptive. 



We have reached the conclusion that four techniques have positive impact: 
galloping, alternating between the two smallest sets, advancing early to addi- 
tional sets when a common element is encountered (a limited form of adap- 
tivity), and updating which sets are smallest. The techniques which had little 
effect, positive or negative, were the Hwang-Lin replacement for galloping, and 
accelerating galloping. The only technique with significant negative impact is 
the full-blown adaptivity based on cycling through all the sets. 

We designed Small Adaptive by starting with SvS and incorporating many es- 
sential features from Adaptive to improve past SvS. In particular. Figures 1171 fRl 
a.nd 1 1 bl ha.ve shown that Small Adaptive wins over SvS. (This can also be veri- 
fied directly.) But how does Small Adaptive compare to our other “extreme,” the 
adaptive algorithm from P|? Surprisingly, Figure El shows that Small Adaptive 
almost always performs better than Adaptive, regardless of the number of sets 
(unlike SvS which was incomparable with Adaptive). 

Table □ summarizes the algorithms encountered, and a few other possible 
combinations. Table |3 shows the average overall running times for these algo- 
rithms, as well as the standard deviations. Interestingly, in this aggregate metric. 
Adaptive outperforms SvS; this is because many queries have only 2 or 3 sets. In 
addition, the algorithm with the smallest average running time is Small Adap- 
tive. We conclude that Small Adaptive seems like the best general algorithm for 
computing set intersections based on these ideas, for this dataset. 

6 Conclusion 

In this paper we have measured the performance of an optimally adaptive algo- 
rithm for computing set intersection against the standard SvS algorithm and an 
offline optimal Ideal. From this measurement we observed a class of instances in 
which Adaptive outperforms SvS. The experiments then suggest several avenues 
for improvement, which were tested in an almost orthogonal fashion. From these 
additional results we determined which techniques improve the performance of 
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Table 2. Algorithm characteristics key table. 



Algorithm 


Cyclic/2 


Sym- 


Update 


Advance on 


Gallop 




Smallest 


metric 


Smallest 


Common Elt. 


Factor 


Adaptive 


Cyclic 
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— 


— 


2 


Adaptive 2 


Cyclic 


Y 


— 


— 


4 


Ideal 


— 


— 


— 


— 


— 


Small Adaptive 


Two 


Y 


Y 


Y 


2 


Small Adaptive Accel. 


Two 


Y 


Y 


Y 


4 


Two-Smallest Adaptive 


Two 


Y 


N 


N 


2 


Two-Smallest Adaptive Accel. 


Two 


Y 


N 


N 


4 


Two-Smallest Binary Search (SvS) 


Two 


N 


N 


N 


— 


Two-Smallest Half-Gallop 


Two 


N 


N 


N 


2 


Two-Smallest Half-Gallop Accel. 


Two 


N 


N 


N 


4 


Two-Smallest Hwang-Lin 


Two 


N 


N 


N 


— 


Two-S’est Smart Binary Search 


Two 


N 


Y 


N 


— 


Two-S’est Smart Half-Gallop 


Two 


N 


Y 


N 


2 


Two-S’est Smart Half-Gallop Accel. 


Two 


N 


Y 


N 


4 



Table 3. Aggregate performance of algorithms on web data. 



Algorithm 


Average 


Std. Dev. 


Min 


Max 


Adaptive 


371.46 


1029.41 


1 


21792 


Adaptive Accelerated 


386.75 


1143.65 


1 


25528 


Ideal 


75.44 


263.37 


1 


7439 


Small Adaptive 


315.10 


962.78 


1 


21246 


Small Adaptive Accelerated 


326.58 


1057.41 


1 


24138 


Two-Smallest Adaptive 


321.88 


998.55 


1 


22323 


Two-Smallest Adaptive Accelerated 


343.90 


1153.32 


1 


26487 


Two-Smallest Binary Search (SvS) 


886.67 


4404.36 


1 


134200 


Two-Smallest Half-Gallop 


317.60 


989.98 


1 


21987 


Two-Smallest Half-Gallop Accelerated 


353.66 


1171.77 


1 


27416 


Two-Smallest Hwang-Lin 


365.76 


1181.58 


1 


25880 


Two-Smallest Smart Binary Search 


891.36 


4521.62 


1 


137876 


Two-Smallest Smart Half-Gallop 


316.45 


988.25 


1 


21968 


Two-Smallest Smart Half-Gallop Accelerated 


350.59 


1171.43 


1 


27220 



an intersection algorithm for web data. We blended theoretical improvements 
with experimental observations to improve and tune intersection algorithms for 
the proposed domain. In the end we obtained an algorithm that outperforms the 
two existing algorithms in most cases. We conclude that these techniques are of 
practical significance in the domain of web search engines. 
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Abstract. Voronoi diagrams of pockets, i.e. polygons with holes, have 
a variety of important applications but are particularly challenging to 
compute robustly. We report on an implementation of a simple algorithm 
which does not rely on exact arithmetic to achieve robustness; rather, 
it achieves its robustness through carefully engineered handling of ge- 
ometric predicates. Although we do not give theoretical guarantees for 
robustness or accuracy, the software has sustained extensive experimen- 
tation (on real and simulated data) and day-to-day usage on real-world 
data. The algorithm is shown experimentally to compare favorably in 
running time with prior methods. 



1 Introduction 

The Voronoi diagram (VD) is one of the most well studied structures in compu- 
tational geometry, finding applications in a variety of fields. First introduced by 
the Russian mathematician Voronoi in his treatise m on the geometry of num- 
bers, Voronoi diagrams have been studied and generalized in several directions 
over the last one hundred years. We refer the reader to ca and m for detailed 
surveys of Voronoi diagrams and related structures, with their applications. 

In this paper we consider the Voronoi diagram of a multiply connected poly- 
gon (a pocket), i.e. a polygon with holes, in which the sites are defined by the 
(reflex) vertices and (open) edges on the boundary of the polygon. The Voronoi 
diagram is defined as the locus of points that have two or more nearest sites. A 
Voronoi diagram divides the polygon into regions associated with the sites, such 
that all points in a region have a common nearest site. 

Computing the Voronoi diagram of a polygon without using exact arithmetic, 
or at least “exact geometric computation” (EGC), is a very difficult task. The 

^ See http://www.ams.sunysb.edu/~saurabh/pvd for latest on pvd. 
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difficulty lies in the floating point round-off errors that are formidable due to the 
high arithmetic degre^ of our problem (estimated to be 40 [2| ) • These round-off 
errors can accumulate and lead to inconsistencies and failure. 

In this paper we give an algorithm to compute the Voronoi diagram of pock- 
ets in the plane. Our algorithm accepts floating point input and performs only 
floating point operations. We have implemented the algorithm in C-|— I- and, al- 
though we have not yet been able to show that it is provably robust, it is shown to 
be robust for all practical purposes. Even after testing on thousands of random 
and real-world inputs, we have not been able to break our code. The software 
has been named “pvd” (“pocket Voronoi diagram”). 

Softwares that use exact arithmetic give exact results. However, there is a 
cost to pay for using exact arithmetic. We have designed our software to save 
this cost by not using exact arithmetic; as a consequence, the output diagram 
may not be exactly correct. There is a trade-off here. A comparative study of the 
two helps us to estimate the cost of using exact arithmetic. If, for a particular 
application efficiency is crucial and some amount of error is permissible, our 
software may be a good solution. 

Computing offsets is one such application. An offset of a polygon is a polygon 
obtained by uniformly “shrinking” (or “growing”) the polygon. More formally, it 
is obtained from a Minkowski sum of the boundary of the polygon with a circular 
disk. Offsets of multiply connected polygons are used in numerically controlled 
(NC) contour milling. In NC milling, given a multiply connected polygon and a 
rotating tool, the goal is to “mill” (cover) the polygon using the tool. One of the 
approaches to do this is to compute successive offsets of the multiply connected 
polygon and then to “stitch” together these offsets (contours) in order to obtain 
a tool path. A thorough study of VD-based milling was done by Held [ 7 |. He 
also studies various practical issues involving milling in general. 

We have used pvd to compute offsets. In fact, our offset-finding routines form 
the core of the NC milling software "EZ-Milf developed by Bridgeport Machines, 
where pvd has been in commercial use for the last two years. 



1.1 Related Work 

There are several algorithms for computing the Voronoi diagram of a simple 
polygon. The first O(nlgn) algorithm was given by Lee |lil| . who built on the 
earlier work of Preparata m- The problem of computing the VD of a set of 
disjoint polygons and circular objects was first studied by Drysdale in his Ph.D. 
thesis P). He achieved a sub-quadratic solution, which was subsequently im- 
proved by Lee and Drysdale to O(nlg^n) time 1 1 4] . Subsuming earlier work on 
computing VD of a set of disjoint line segments and circular arcs. Yap came up 
with a sophisticated worst-case optimal 0{nlgn) algorithm [ 23 |- For the special 
case of a simple polygon (without holes), a sophisticated linear time algorithm 
to compute the VD was given by Chin et.al. |3j. 



^ An algorithm has degree d if its tests involve polynomials of degree < d H3. 



Pocket Voronoi Diagrams 107 



Several attempts have been made previously to implement the Voronoi dia- 
gram of a multiply connected or simple polygon or a set of line segments and 
points. In the first systematic study, Held Q implemented a divide-and-conquer 
algorithm similar to Lee’s as well as an algorithm based on the wave propaga- 
tion approach ^ outlined by Persson UBI Although his implementation is very 
efficient, it is not robust and can crash. Another software (pivor) available is by 
Imai El, who implemented an incremental algorithm for a disjoint set of line 
segments and points. His method guarantees “topological correctness”. While his 
algorithm is worst-case cubic, its performance in practice is substantially better. 
His implementation gives no guarantee on the accuracy of the VD. Further, our 
tests of his code have led to unacceptably incorrect outputs in several cases. It 
is coded in Fortran and does not accept any floating point input but accepts 
only decimal inputs up to 6 decimal places. This is unacceptable for many appli- 
cations. Another code (avd) is written by M. Seel based on exact arithmetic in 
LEDA jl ti) . This code gives the exact Voronoi diagram for any input. However, 
the use of exact arithmetic comes with a considerable penalty in the running 
time. 

Very recently, the second author of this paper has developed another code 
(vroni 0EI), in the time since pvd was first developed and deployed in practice. 
Vroni uses techniques different from pvd to achieve its robustness and perfor- 
mance, though it also abides by the use of floating point arithmetic and carefully 
engineered robustness. 

Pivor, pvd, and vroni do not guarantee accuracy of the VD. They only guar- 
antee a form of “topological” correctness, which means that each site has a con- 
nected Voronoi region and the Voronoi regions of a segment and its end-points 
are adjacent. From a theoretical perspective this is not much of a guarantee. 
However, these softwares may perform well in practice and give practical solu- 
tions to real-world problems. The major strength of avd is that it does guarantee 
the exact computation of the correct VD. However, using software based on ex- 
act arithmetic may be prohibitive for some applications. A comparative study of 
software that uses exact arithmetic with software that uses only floating point 
computation helps us to understand the tradeoffs. 

Also related to our work is the implementation by Sugihara and Iri for com- 
puting VD of a million points izq and the implementation by Hoff et.al. that 
computes VD of segments and curves in two and three dimensions using graphics 
hardware m- Also of related interest is the work by Liotta et.al. on proximity 
queries in 2D using graphics hardware US). 

2 Simple Polygons 

The Voronoi diagram of a simple polygon, with sites corresponding to the (open) 
boundary segments and the reflex vertices, has a very simple structure: it is a tree 
with some of the leaves “touching” each other at reflex vertices of the polygon. 
The leaf edges of the Voronoi tree are determined by consecutive sites on the 
boundary of the input polygon. If we ignore the geometry of the Voronoi diagram 
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and only look at the underlying graph, we get what is called the Voronoi graph. 
The dual of this graph is the Delaunay graph, which, for non-degenerate inputs, 
is a Delaunay triangulation. Degenerate inputs possibly lead to cycles of size 
larger than three; we handle such cases by arbitrarily triangulating such cycles, 
which is equivalent to allowing zero-length Voronoi edges in the Voronoi diagram 
(so that the degrees of all Voronoi vertices are three). 




Fig. 1. Computing Voronoi Diagram and Delaunay Triangulation. 



We know that there exists a Voronoi edge and hence a Delaunay edge between 
consecutive sites on the boundary of the input polygon. Thus, the outer cycle 
of the Delaunay triangulation is simply the cycle of consecutive sites of the 
input polygon. Our algorithm is based on simple “ear clipping” : It begins with 
this cycle and incrementally cuts “Delaunay ears” in order to obtain the final 
Delaunay triangulation. Here, an ear is a triangle two of whose sides are edges 
of the current cycle; a Delaunay ear is an ear such that the three sites involved 
determine an empty circle (a disk having no portion of the polygon boundary in 
its interior) touching them. 

2.1 Delaunay Ear Cutting 

Given a cycle of Delaunay edges, finding a Delaunay ear to cut is the central 
task in our algorithm. For a cycle of length n there are n possible ears, each of 
which can be tested for Delaunayhood by checking the empty circle property. 
Since any triangulation has at least two ears, we should, in theory, always be 
able to identify a Delaunay ear. Numerical inaccuracies, however, can result in 
a situation in which we do not find any Delaunay ear. In such cases we go into a 
“relaxed Delaunayhood” mode, described later. We do not need to check a linear 
number of ears every time we want to find a Delaunay ear, since every time we 
cut an ear, only the two neighboring candidate ears are affected. 

2.2 Delaunay Ear Test 

To check if an ear is Delaunay we first find the circle defined by the two edges 
of the cycle. The two Delaunay edges of the ear each correspond to a bisector 
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between two sites. The first intersection of these bisectors gives the center of 
a circle touching all three sites of the ear. The ear is Delaunay if this circle is 
empty. 

We find the center and radius of this circle and check if it is empty by com- 
paring its radius with the distance of its center with all other sites of the cycle. 
Thus, in the worst case, it can take linear time to check for Delaunayhood of an 
ear, leading to an overall worst-case quadratic time algorithm. In pvd, however, 
we utilize basic grid hashing so that we obtain, in practice, closer to linear time 
(or O(nlogn) time) performance of our algorithm for most practical inputs. For 
polygons having a high degree of cocircularity (such as a circle approximated by 
numerous line segments), our algorithm does exhibit quadratic-time behavior. 
Future improvements planned for pvd will address this weakness. 

2.3 Finding the Ear Circle 

Finding the center of the circle corresponding to an ear is the only non-trivial 
task in our algorithm. Once we find the center of the circle, we take as the 
radius of the ear circle the minimum of the distance from this center to the 
three corresponding sites. (We have found that taking the minimum reduces 
the number of times we go into the relaxed Delaunayhood mode.) Also, we 
discard all circles that do not lie inside the bounding box of the input polygon 
as non-Delaunay. This is justified, since any Delaunay circle must lie inside the 
input polygon. This step not only improves the efficiency but also is vital for 
robustness. If we do not do this we can run into robustness problems with a circle 
whose center lies far away from the input polygon, resulting in the distances for 
the center to any site of the polygon appearing to be nearly the same, up to 
floating point precision. 

In order to find the center of the ear circle, we need to intersect the two 
bisectors corresponding to the two Delaunay edges of the ear. We consider each 
Delaunay edge on the cycle to be oriented in the counter-clockwise direction 
of the cycle. A bisector corresponding to a Delaunay edge can be either a line 
bisector or a parabolic bisector, depending on its source and target sites. The 
source and target sites can be line segments or reflex vertices. A segment site 
may or may not have its endpoints as point sites. 

We deal with several different types of ears in the code depending on the 
types of the three sites comprising the ear. However, there are essentially only 
three cases: ears with two line bisectors, ears with a line and a parabolic bisector, 
and ears with two parabolic bisectors. We describe our algorithm for each of the 
three cases. The primitives we use to implement these cases are described later. 

For each of the cases we call the first site of the ear in counter-clockwise 
order p\ or si depending on whether it is a point or a segment. Similarly, we 
call the second and third sites p2 or §2 and pz or S3, respectively. We call the 
first bisector, i.e. the bisector between the first and second sites, h\ and the 
second bisector, i.e. the bisector between the second and third sites, 62- Also, we 
consider the bisectors to be directed away from the boundary of the polygon. 
We refer to a point on a bisector as “before” or “after” another point on the 
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same bisector according to the order along the directed bisector. We assume a 
counter-clockwise orientation for the polygon. Thus the region of interest (the 
interior) of the pocket is always to the left of any segment. 

There are certain conditions common to all three cases that must be sat- 
isfied. While these conditions are naturally satisfied for an exactly computed 
Delaunay graph, it is important to the robustness of our implementation that 
we specifically impose each of these conditions. 

1. The first and third sites should be eligible to be adjacent in the Delaunay 
graph, where the eligibility criteria are: 

— Two point sites are always considered to be adjacent. 

— A point site can be adjacent to a segment site only if it lies on the left 
of the directed line containing the (directed) segment. 

— A segment site can be adjacent to another segment site if at least one of 
the ends of one segment is on the left of the directed line containing the 
other segment, and vice versa. See Figure |2| 



Fig. 2. For two segments to be eligible to be adjacent each should have at least one 
endpoint on the left side of the other, (a). Segments are eligible to be adjacent; (b). 
Segments are ineligible to be adjacent. 

2. The second bisector must cross the first bisector from its right to its left side. 

3. If the first site is a segment, si, then along bi, the center of the ear circle 
must lie before the point where bi intersects the perpendicular to Si at its 
source endpoint. See Figure El 

4. If the third site is a segment, S 3 , then along & 2 , the center of the ear circle 
must lie before the point where 62 intersects the perpendicular to S 3 at its 
target endpoint. See Figure0 

5. The center of the ear circle must lie inside the bounding box of the input 
polygon. 

Two line bisectors. This is a simple case in which the two bisectors intersect in 
one point or they do not intersect. If the necessary conditions above are satisfied, 
we get a valid ear circle. 

One line and one parabolic bisector. In this case, the two bisectors can have 
zero, one or two intersection points. However, at most one intersection point can 
satisfy the first necessary condition. See Figure 0 

There is an additional condition that needs to be checked in this case. Note 
that if there are two intersection points, these two points divide each bisector 
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Fig. 3. Illustration of necessary conditions 3 and 4. 



into three parts. We will call them the first, second and third part, respectively. 
Now consider the previous Delaunay circle for bi. It has to be empty; hence, 
its center must lie on the side of 62 on which the second site lies. On that side, 
depending on the types of sites, either we would have the second part of b\ or 
we would have the first and third part of bi. In the latter case we need to check 
if it is on the first or the third part, since, if it is on its third part, it means 
that we are moving away from the intersection points and therefore none of the 
intersection points are valid. An analogous check is also made for 62. 




Fig. 4. One line and one parabolic bisector. 



Two parabolic bisectors. In this case we compute the line bisector between the 
first and third site and find the intersection between this line bisector and the 
thinner of the two parabolic bisectors, thereby reducing this case to the case of 
one line and one parabolic bisector. 

2.4 Geometric Primitives 

Sidedness primitive. A standard primitive is to decide whether a point C lies to 
the left, right or on the line defined by A and B. Most often, one does this by 
testing the sign of a determinant (the signed area of the parallelogram) . We have 
found, however, that we are better off for robustness in using a pseudoangle based 
on slope, and then to decide sidedness based on the pseudoangles. In particular, 
we use pseudoangles based on partitioning 27 t into 8 sectors of tt/ 4 radians each, 
and mapping an angle 9 G (0,7 t/ 4) to a pseudoangle (slope) m £ (0, 1). In this 
way, each angle 9 gets mapped to a pseudoangle between 0 and 8. 

Although less efficient than the standard signed area test, this test is much 
more robust for our purposes. The signed area can lead to trouble for cases when 
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the area is very small. For example, suppose points A and B are extremely close 
to each other in comparison to distances to point C . In this case, irrespective of 
where point C is, the area is very small. Hence, for different values of C it can 
lead to conclusions that are not consistent with each other. Another example 
is when a slowly turning curve is finely approximated by line segments. In this 
case, any two consecutive segments appear to be collinear to a primitive based 
on area computation; then, a chain of such observations may ultimately lead to 
incorrect inferences. 

Intersection of two lines. We find the intersection of two lines by solving the 
equations to find one of the coordinates and then using the equation of the first 
line to get the other coordinate. This ensures that even if we have numerical 
errors the intersection point lies some where on the first line. It is also crucial 
for robustness to choose which coordinate to compute first. We choose the one 
which avoids division by a small number. 

Bisector of two lines. For our purposes, in most cases when we want to find the 
bisector between two lines, we already know one point on the bisector. Then, we 
first determine the slope of the bisector, which can always be determined robustly 
irrespective of the relative slopes of the two lines. This slope and the already 
known point determine the bisector. In cases in which we do not already know a 
point on the bisector, we calculate the bisector using elementary geometry. The 
equation of the bisector can be obtained from the equations of the two lines, 
using only additions, multiplications and square roots. 



2.5 Relaxed Delaunayhood 

As mentioned in Section IQ due to numerical imprecision our algorithm can 
get into a situation in which there are no Delaunay ears to cut while the trian- 
gulation is not yet complete. In such cases, our algorithm relaxes the condition 
for Delaunayhood, as follows. We now consider an ear to be “Delaunay” if it is 
approximately Delaunay: the circle corresponding to the ear is empty after its 
radius is reduced by multiplying it by (1 — /) (while maintaining the same center 
point), where / is a fraction called the shrink ratio. Initially, / is taken to be the 
smallest floating point number such that (1 — /) < 1; it is then multiplied by 
2 each time the algorithm cannot find a Delaunay ear to cut, until it succeeds, 
after which it is reset. 

If the algorithm never goes into the relaxed Delaunayhood mode, it computes 
a Voronoi diagram that is accurate up to floating point round-off error. In the 
relaxed mode, the shrink ratio determines the measure of accuracy of the output. 
In fact, if / ever goes above 2“^, we report an error. However, this has never 
happened in all the experiments we have conducted and all of the real-world use 
the algorithm has had to date. For most inputs the algorithm does not go into 
relaxed mode. For a few inputs, it enters the relaxed mode once, but it is rare 
to enter more than once. 
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3 Polygons with Holes 

Our algorithm requires a Hamiltonian cycle of Delaunay edges at its initializa- 
tion, after which it performs Delaunay ear clipping to determine the Delaunay 
triangulation. For a simple polygon, the Hamiltonian cycle is readily defined by 
the sequence of sites on the polygon, in order around its boundary. For a poly- 
gon with holes, the boundary consists of many such cycles; we need to link these 
cycles using some bridging Delaunay edges and then double them in order to ob- 
tain a single cycle. Thus we need to compute a “Delaunay spanning tree” (DST) 
of contours, which interconnects the contours into a tree using Delaunay edges. 




Fig. 5. Finding a Delaunay spanning tree (DST) for a pocket. 



We find a DST of contours by incrementally finding Delaunay edges that 
bridge between connected components of contours. At any given stage, we at- 
tempt to find a bridge (Delaunay edge) linking an “island” contour to the set 
of contours that have already been connected by bridges to the outer boundary. 
We omit details from this abstract. 

In the worst case, this bridging algorithm requires O(kn^) time. However, 
the use of simple bucketing makes it 0(kn) time in practice. For small k, this is 
efficient enough in practice; for large k, it becomes a bottleneck in our method, 
making it less effective for pockets having a large number of holes. Of course, 
in theory, we can implement more efficient methods (e.g., O(nlogn)); however, 
our objective was to keep the algorithm and its implementation very simple. We 
expect to improve the efficiency of the hole bridging in a future release of the 
code. 

4 Efficiency Measures 

Our algorithm to compute the VD of a simple polygon is worst-case quadratic, 
since we do 0(n) empty circle tests each of which may take 0{n) time. In con- 
trast, there is a theoretically achievable linear time bound (which relies on linear- 
time triangulation, a result having no corresponding implementation). However, 
we achieve a running time much closer to linear (or O(nlogn)) time in practice 
by using simple grid hashing to make our empty circle test perform in constant 
time in practice. We use a grid of size 0{^/n) x resulting in 0{n) rect- 

angular buckets. A site is associated with a bucket if it intersects the bucket. 
Thus, a very long segment-site can lie in many buckets; a point-site will always 
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lie in a single bucket. When performing an empty circle test, we examine only 
those sites that are associated with buckets that intersect the bounding box of 
the query circle. 

5 Implementation 

Pvd consists of about 8000 lines of C++ code and is available by request for 
research purposes, without fee. It is organized in the form of a library with a 
clean and simple interface to enable easy inclusion in client software. It has been 
compiled and tested on Unix using g++ and on Windows95 using Visual C++. 
We are in the process of testing it on other platforms. 

6 Experimental Results 

Pvd has sustained tests and real-world usage on hundreds of data sets obtained 
from boundaries of work-pieces for NC machining and stereolithography. The 
most important characteristic of pvd for our motivating applications is its ro- 
bustness; it has always succeeded in computing the Voronoi diagram (or a close 
approximation thereto) for pockets. (The only exception to this are some cases 
in which the number of holes is extremely large, causing our current implemen- 
tation of hole bridging to run out of memory.) The Voronoi diagrams that it 
computes are used every day by engineers and designers for computing offsets 
and automatic generation of tool paths. 

Another important characteristic of pvd is its efficiency in practice. As part 
of our timing studies, we have conducted experiments on about 800 data sets 
created synthetically by various means, including the Random Polygon Gener- 
ator (RPG) developed by Auer and Held Our experiments were carried out 
on a Sun Ultra 30, running Solaris 2.6 on a 296 MHz processor with 384 MB of 
main memory. All cpu times are given in milliseconds. 

Tabled shows the average time consumed by each software for random inputs 
of different sizes. Each entry is the average of cpu times (per segment) for 10 
random inputs of the same size. The random inputs were generated by RPG. 
“n/a” means that the software did not finish in a reasonable amount of time. Pvd 
is seen to perform substantially faster than the prior methods (avd and pivor), 
while being somewhat inferior to the newest method of Held p] that was just 
released (vroni jS|). 

The main weakness we have observed in pvd is in its inefficiency in the cases 
of (1) numerous small segments that are nearly cocircular or that lie on a smooth 
curve (e.g., ellipse), since in this case our simple grid hashing is much less effec- 
tive in empty circle queries, and pvd exhibits quadratic behavior; and (2) pockets 
having numerous holes, since our current implementation handles these ineffi- 
ciently, using 0{kn) time and space for k holes. (In particular, this can 

mean that we run out of memory for huge k.) The newly released vroni software 
does not seem to suffer from these weaknesses; see Held p]. 
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Table 1. CPU time (per segment) for the different Voronoi codes applied to the “RPG” 
polygons. The “n/a” entries indicate that no result was obtained within a reasonable 
period of time. 



size 


RPG 


avd 


pivor 


pvd 


vroni 


64 


50.55 


1.359 


0.164 


0.109 


128 


75.26 


0.922 


0.160 


0.107 


256 


119.8 


0.657 


0.162 


0.108 


512 


209.6 


0.565 


0.158 


0.115 


1024 


404.4 


0.517 


0.167 


0.114 


2048 


758.9 


0.511 


0.167 


0.117 


4096 


n/a 


0.561 


0.175 


0.126 


8192 


n/a 


0.643 


0.185 


0.136 


16384 


n/a 


0.898 


0.209 


0.148 


32768 


n/a 


1.293 


0.234 


0.156 



7 Conclusion and Future Directions 

We have reported on a system, pvd , which implements a simple and robust algo- 
rithm for computing the Voronoi diagram of polygonal pockets. The algorithm 
is carefully engineered to be robust in the face of floating point errors, without 
resorting to exact arithmetic. The overall algorithm is simple and all the crucial 
techniques that ensure robustness are restricted to a small number of primitives. 
The implementation is very practical from the efficiency point of view, except 
possibly in the case of a very large number of holes or “smooth” polygons. While 
we have not yet shown theoretical guarantees for our algorithm, our implemen- 
tation has been extensively tested and in use for over two years now. We are also 
optimistic that one can in fact prove the algorithm to be robust and accurate, 
up to errors introduced by machine precision. More details on pvd would appear 
in 120]. 

Finally, we refer the reader to Held lOE] for detailed information on the most 
recent system, vroni, which is based on an alternative incremental method of 
computing Voronoi diagrams robustly. Vroni has been shown to compare favor- 
ably with pvd in robustness, speed, and generality. 
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Abstract. We focus on the problem of experimentally evaluating the 
quality of hierarchical decompositions of trees with respect to criteria 
relevant in graph drawing applications. We suggest a new family of tree 
clustering algorithms based on the notion of t-divider and we empirically 
show the relevance of this concept as a generalization of the ideas of 
centroid and separator. We compare the t-divider based algorithms with 
two well-known clustering strategies suitably adapted to work on trees. 
The experiments analyze how the performances of the algorithms are 
affected by structural properties of the input tree, such as degree and 
balancing, and give insight in the choice of the algorithm to be used on 
a given tree instance. 



1 Introduction 

Readable drawings of graphs are a helpful visual support in many applications, 
as they usually convey information about the structure and the properties of a 
graph more quickly and effectively than a textual description. This led in the 
last years to the design of several algorithms for producing aestethically pleasing 
layouts of graphs and networks, according to different drawing conventions and 
aestethic criteria [3|. As the amount of data processed by real applications in- 
creases, many realistic problem instances consist of very large graphs and suitable 
tools for visualizing them are more and more necessary. Actually, standard visu- 
alization algorithms and techniques do not scale up well, not only for efficiency 
reasons, but mostly due to the finite resolution of the screen, which represents a 
physical limitation on the size of the graphs that can be successfully displayed. 

Common approaches to overcome this problem and to make it possible visu- 
alizing graphs not fitting in the screen exploit clustering information mum 
and/or navigation strategies [ Ifbf 1 7j . In particular, the clustering approach con- 
sists of using recursion combined with graph decomposition in order to build 
a sequence of meta-graphs from the original graph G: informally speaking, a 
meta-graph in the sequence is a “summary” of the previous one; the first and 
more detailed meta-graph coincides with G itself. To grow a meta-graph, suit- 
ably chosen sets of related vertices are grouped together to form a cluster and 
induced edges between clusters are computed. Clusters are then represented as 

* Work partially supported by the project “Algorithms for Large Data Sets: Science 
and Engineering” of the Italian Ministry of University and of Scientific and Techno- 
logical Research (MURST 40%). 



A.L. Buchsbaum and J. Snoeyink (Eds.); ALENEX 2001, LNCS 2153, pp. 117-UJiil 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



118 



I. Finocchi and R. Petreschi 



single vertices or filled-in regions, thus obtaining a high-level visualization of the 
graph while reducing the amount of displayed information by hiding irrelevant 
details and uninteresting parts of the structure. There is no universally accepted 
definition of cluster: usual trends consist of grouping vertices which either share 
semantic information (e.g., the space of IP addresses in a telecommunication 
network) or reflect structural properties of the original graph [1 

The sequence of meta-graphs defines a hierarchical decomposition of the orig- 
inal graph G which can be well represented by means of a rooted tree, known 
in literature as hierarchy tree 0, whose leaves are the vertices of G and whose 
internal nodes are the clusters. In order to grow a hierarchy tree out of a graph 
G, a simple top-down strategy works as follows: starting from the root of the 
hierarchy tree, which is a high-level representation of the whole graph, at the 
first iteration the vertices of G are partitioned by a clustering algorithm and the 
clusters children of the root (at least 2) are generated. The procedure is then 
recursively applied on all these clusters. It is obvious that very different hier- 
archy trees can be associated to the same graph, depending on the clustering 
algorithm used as a subroutine. Suitable selections of nodes of a hierarchy tree 
lead to different high-level representations (views) of G: the user of the visualiza- 
tion facility can then navigate from a representation to another one by shrinking 
and expanding clusters in the current view. Due to the fact that hierarchy trees 
are interactively visited, they should fulfil some requirements that are relevant 
from a graph drawing perspective. First of all, any view obtained from a hierar- 
chy tree of a graph G should reflect the topological structure of G so as not to 
mislead the viewer. Furthermore, structural properties such as limited degree, 
small depth and balancing of the hierarchy tree greatly help in preserving the 
mental map of the viewer during the navigation and make it possible to provide 
him/her with several different views of the same graph. 

The main goal of this paper is to experimentally evaluate with respect to the 
aforementioned criteria the quality of hierarchy trees built using different cluster- 
ing subroutines. More precisely, in our study we are concerned with hierarchical 
decompositions of trees, not only because such structures frequently arise in 
many practical problems (e.g., evolutionary or parse trees), but also because 
tree clustering procedures represent an important subroutine for partitioning 
generic graphs (for example, they could be applied to the block-cut-vertex tree 
of a graph). As a first contribution, we suggest new tree clustering algorithms 
hinging upon the notion of t-divider, which generalizes concepts like centroid and 
separators ixn, and we present three partitioning strategies, each aimed at 
optimizing different features of the hierarchy tree. We then explore the effec- 
tiveness of the t-divider based algorithms for different values of t by comparing 
them against two well-known clustering procedures due to Gonzalez and 
Frederickson EH, suitably adapted to work on general trees. 

The quality measures considered in the experimentation are defined in agree- 
ment with the requirements on the structure of the hierarchy tree arising in 
graph drawing applications. All the experiments are carried out on structured 
randomly generated instances, which allow us to study how the performances 
of the algorithms are affected by parameters such as degree and balancing of 
the input tree. In particular, looking at the experimental results, we can detect 
which algorithms are more sensitive to the structure of the input instance and 
identify hard and easy problem families for some of the analyzed algorithms. The 
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Fig. 1. (a) A graph G; (b) a hierarchy tree of G and a covering C={3,d,b,g,12,f} on 
it; (c) view of G induced by covering C. 



results give insight in the choice of the algorithm to be used when a particular 
tree instance is given. Furthermore, they provide experimental evidence for the 
utility of the theoretical notion of t-divider, proving that the use of centroids, 
that is very frequent in literature, does not usually yield the best solution. 

In order to prove the feasibility of the clustering approach for the visualization 
of large trees we have been also developing a prototype of visualization facility 
which implements all the algorithms discussed in the paper and supports the 
navigation of the hierarchy tree as described above. The use of such a prototype 
gave us visual feedback on the behaviour of the algorithms and was helpful for 
tuning the code and for testing heuristics to design more efficient variations. The 
prototype is implemented in ANSI C and visualizations are realized in the algo- 
rithm animation system Leonardo jS|, available for Macintosh platform at the 
URL http://www.dis.uniromal.it/~demetres/Leonardo/. The tree cluster- 
ing package is available at the URL http : //www. dsi .uniromal . it/~f inocchi/ 
experim/treeclust-2 . 0/. 

In the rest of this paper we first recall preliminary definitions and concepts re- 
lated to hierarchy trees in Section|21 In Sectional after introducing the concept of 
t-divider of a tree, we design three strategies for partitioning trees based on such 
a concept. Gonzalez’s and Frederickson’s clustering approaches are summarized 
in Section 0 Section 0 describes the experimental framework and the results of 
the experimentation are discussed in Section El Conclusions and directions for 
further research are finally addressed in Section 0 

2 “Attractive” Hierarchy Trees 

In this section we recall the definition of hierarchy tree and we discuss the con- 
cepts of covering and of view of a graph on a hierarchy tree associated to it j2| . 
We then point out which properties make a hierarchy tree attractive from a 
graph visualization point of view, i.e., which requirements should be fulfilled by 
a hierarchy tree so that graph drawing applications can substantially benefit 
from its usage. 

Definition 1. A hierarchy tree HT = (N,A) associated to a graph G = (V,E) 
is a rooted tree whose set of leaves concides with the set of vertices ofG. 

Each node c G N represents a cluster of vertices of G, that we call vertices 
covered by c. Namely, each leaf in HT covers a single vertex of G and each 
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Fig. 2. Validity of hierarchy trees of a 4- vertex chain. 



internal node c covers all the vertices covered by its children, i.e., all the leaves 
in the subtree rooted at c. For brevity, we write u c to indicate that a vertex 
It € R is covered by a cluster c G N. The cardinality of a cluster c is the number 
of vertices covered by c; moreover, for any c G N, we denote by 5(c) the subgraph 
of G induced by the vertices covered by c. Two clusters c and c' which are not 
ancestors of each other are connected by a link if there exists at least an edge 
e = {u, v) G E such that u < c and v < d to. HT; if more than one edge of this 
kind exists, we consider only a single link. We denote by L the set of all such 
links. Given a subset N' of nodes of HT, the graph induced by N' is the graph 
G' = {N',L'), where L' contains all the links of L whose endpoints are in N' . 
From the above definitions G' contains neither self-loops nor multiple edges. 

Definition 2. Given a hierarchy tree HT of a graph G, a covering of G on HT 
is a subset G of nodes of HT such that \/v G V Bl c G C : v ^ c. A view of G 
on HT is the graph induced by any covering of G on HT. 

Trivial coverings consist of the root of HT and of the whole set of its leaves. 
Figure ma) and Figure mb) show a 12-vertex graph and a possible hierarchy 
tree of it. The internal nodes of the hierarchy tree are squared and, for clarity, 
no link is reported. A covering consisting of clusters {3,d,b,g,12,f} is highlighted 
on the hierarchy tree and the corresponding view is given in Figure H^c). 

As this work is concerned with recursive clustering of trees and forests, in 
the rest of the paper we assume that the graph to be clustered is a free tree 
T = (V,E) with n = \V\ vertices. Under this hypothesis it is natural to require 
any view of T obtained from HT to be a tree. In in HT is said to be valid 
iff it satisfies this property. It is worth observing that not any hierarchy tree 
associated to a tree is valid. Figure|2l(b) and Figure|21(c) show a simple example 
of a valid hierarchy tree associated to a chain on four vertices and a view of 
the chain on it, while Figure El(d) depicts a non-valid hierarchy tree associated 
to the same chain: the view associated to covering {a, 2,3} contains a cycle, as 
in Figure l^e). All the existing links are shown on the hierarchy trees as dotted 
lines. Due to the importance of the notion of validity for graph drawing purposes, 
it is natural to restrict the attention on algorithms that generate valid hierarchy 
trees. In El necessary and sufficient conditions for the validity of hierarchy trees 
are presented; we reveal in advance that all the clustering algorithms described 
in Section Eland in Section 0 meet such requirements. For the purposes of this 
paper it is enough to recall a weaker sufficient condition: 

Theorem 1. ms Let T = (V, E) be a free tree and let HT = (N,A) be a 
hierarchy tree ofT. Then HT is valid if for each c G N s.t. 5(c) is disconnected 
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Fig. 3. Attractive and unattractive hierarchy trees associated to a 11-vertex star cen- 
tered at vertex 0. The trees are grown up by the algorithms presented in Section|^and 
in Section|3 Snapshots are produced in the algorithm animation system Leonardo 0. 



there exists a vertex v , v -/t c, sueh that eaeh eonneeted eomponent of S{c) 
contains a neighbor of v in T. 

In addition to validity, we can identify other interesting requirements on 
the structure of hierarchical decompositions that make them attractive. As an 
example, let us consider the hierarchy trees in FigureEl all of them are associated 
to the same tree (a 11- vertex star centered at vertex 0) and are valid. Let us 
now suppose that the user of the visualization facility needs a view containing no 
more than 5 vertices, one of which must be vertex 0. It is easy to see that such 
a view can be obtained only from the hierarchy tree in Figure EJc). The main 
problem with the tree in Figure Ol^a) is that it is really deep and unbalanced: any 
covering of cardinality < 5 can show at the maximum detail only vertices 2, 3, 
4, and 5, i.e., the leaves in the highest levels of the hierarchy tree. Similarly, the 
tree in Figure OJb) is shallow and has a too large degree: it contains only a view 
with at most 5 nodes, i.e., the view consisting of its root. This simple example 
gives some insight on the fact that navigating balanced hierarchy trees with 
limited degree provides the viewer with better possibilities of interaction with 
the visualization tool, since a high number of views can be obtained from such 
trees and preserving the mental map becomes easier. The following structural 
properties are therefore relevant for hierarchy trees that are going to be used in 
visualization applications: 

Limited degree: the expansion of a cluster should not result in the creation of a 
high number of new nodes in the view. Indeed, adding all of a sudden many 
new clusters to the view could imply drastic changes in its drawing. 

Small depth: as suggested in IXKI . a logarithmic depth appears to be a rea- 
sonable choice. Indeed, traversing the hierarchy tree should not require too 
much time, but an excessively small depth (e.g., constant) is indicative of 
big differences between consecutive views. 
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Balancing: nodes on the same level of the hierarchy tree should have similar 
cardinalities. In this way, any navigation from the root to a leaf takes ap- 
proximately the same time, independently of the followed path. 

3 Divider-Based Clustering Algorithms 

In this section we first introduce the concept of t-divider and then we devise 
three clustering strategies which hinge upon this concept. From now on we call 
rank(T) the number of vertices in a tree T. 

Definition 3. Given a free tree T = (V, E) with n vertices and a eonstant t > 2, 
a vertex v € V of degree d{v) is ealled t-divider if its removal disconnects T into 
d{v) trees each one containing at most [^^n\ vertices. 

The concept of t-divider is a natural generalization of that of centroid 0 
and separator dBEa, which are a 2-divider and a 3-divider, respectively. The 
existence of centroids and the fact that, V ti > t 2 , any t 2 ~divider is also a ti~ 
divider, imply the existence of at least a t-divider for any value t > 2. We remark 
that t-dividers are not necessarily unique. The following theorem holds: 

Theorem 2. Let T = (V,E) be a free tree and let v be any vertex of V. For 
any constant t > 2, if v is not a t-divider, the removal of v disconnects T into 
k subtrees Ti...Tk such that: (a) it exists a unique Ti with more than \J^ri\ 
vertices; (b) any t-divider ofT belongs to Ti. 

Based on Theorem |21 an efficient algorithm for finding a t-divider of a tree 
starts from any vertex v verifying if r; is a t-divider. If so, it stops and returns 
V. Otherwise, it iterates on the subtree of maximum rank Ti obtained by the 
removal of v; the new iteration starts from the vertex w € Ti adjacent to v. 

The concept of t-divider can be easily extended from trees to forests. We 
say a forest T to be t-divided if all its trees have rank < . Given a non 

t-divided forest E, we can t-divide it by removing a single vertex, called t-divider 
for T . It is easy to prove that this vertex must be searched in the maximum rank 
tree of T , say Tj, and that any t-divider of Ti is also a t-divider for T . 

Let us now consider a generic step during the top-down construction of the 
hierarchy tree HT and let c be the node of ITT considered at that step. Let us 
assume that S{c) is a tree named T. The following algorithm is a straightforward 
application of the concept of t-divider: 

Algorithm SimpleClustering (SC): After a t-divider d has been found, the 
subtrees Ti...Tfe obtained from T by removing d are generated. We then create 
k children of node c: these children are the subtrees T\...Tk where we add back 
vertex d to the one having minimum rank. SC is then recursively applied on all 
the generated clusters. 

The height of the hierarchy tree returned by algorithm SimpleClustering is 
O(logn) and an upper bound on the degree of its internal nodes is the degree of 
T itself. Moreover, for any cluster c, the subgraph S{c) induced by the vertices 
covered by c is connected, i.e., it is a tree. From now on, where there is no 
ambiguity we refer to this property as connectivity of clusters. Though algorithm 
SimpleClustering can be implemented using “off the shelf” data structures and 
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generates hierarchy trees with small depth, the structure of a hierarchy tree HT 
grown by SC may be irregular and may depend too much on the input tree: 
namely, the degree of the internal nodes of HT may be large, according to the 
degree of the t-divider found by the algorithm at each step (see Figure El(b)). 

In the rest of this section we therefore present two extensions of this algorithm 
which aim at overcoming these drawbacks. We move from the more realistic 
hypothesis that the degree of the hierarchy tree should be limited by a constant 
g > 2. Under this assumption, we exploit the idea that the regularity of the 
structure of the hierarchy tree can be best preserved if clusters covering non- 
connected subgraphs are allowed: in other words, if one is willing to give up 
the property of connectivity, we expect that more balanced hierarchy trees can 
be built. On the other side, if clusters are allowed to be non-connected, special 
attention must be paid to guarantee the validity of the hierarchy tree. 

Finding connected clusters. The algorithm that we are going to present produces 
hierarchy trees with bounded degree g > 2 and guarantees the connectivity of 
clusters. Hence, it always returns valid hierarchy trees. 

Algorithm ConnectedClustering (CC): after finding the t-divider d and gen- 
erating the subtrees Ti...Tk, the algorithm checks the value fc. If fc < g, it works 
exactly as algorithm SC. If this is not the case, the subtrees are sorted in non- 
increasing order according to their ranks: say the sorted sequence. 

The children of node c will be , T', where T' = {d} U 

It is easy to see that T' is connected thanks to the presence of the t-divider 
d and that the degree of HT is at most g (it could be smaller than g due to the 
case k < g). Unfortunately, nothing is guaranteed about the rank of T' . Hence, 
HT may be very unbalanced up to reach linear height (see Figure a)). 

Finding balanced clusters. In the following we present an algorithm aimed at 
producing hierarchy trees with a limited degree g > 2 and as most balanced as 
possible. To improve balancing we admit the existence of non-connected clusters. 
Hence, from now on we assume that our clustering procedure works on a forest 
T instead of a tree. We call Fi . . . Fh the h trees in the forest. We also assume 
that each tree Fi has a representative vertex which satisfies the following 
property: any is connected to a vertex v ^ T, v belonging to the original 
tree to be clustered. Our clustering algorithm always produces clusters which 
maintain this invariant property: then any hierarchy tree built by this algorithm 
is valid according to Theorem [I] 

Before introducing algorithm BalaincedClustering, we need to briefly recall 
a classical partitioning problem concerning the scheduling of a set of jobs, each 
having its own length, on p identical machines: the goal is to minimize the 
schedule makespan, i.e., the time necessary to complete the execution of the jobs. 
This problem has both an on-line and an off-line version and is NP-complete even 
when p = 2[S|. A simple approximation algorithm for it consists of considering 
the jobs one by one and of assigning a job to the machine currently having the 
smallest load \m. In the off-line case, very good solutions (i.e., an approximation 
ratio equal to | can be obtained if the jobs are previously sorted in non- 

increasing order with respect to their lengths, so as to consider longer jobs first. 
In the sequel we will refer to this algorithm as Scheduling. 
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We now come back to our original clustering problem. Based on the concept 
of t-divider of a forest, algorithm BalancedClustering works as follows: 

Algorithm BalsincedClustering (BC): it first verifies if T is already t- 
divided. If so, it calls procedure Scheduling, where p = g, the jobs coincide 
with the trees in the forest, and the length of a job is the rank of the corre- 
sponding tree. If T is not t-divided, after a t-divider for T has been found in 
the maximum rank tree if;, a set of subtrees of Ai, named Ti...Tk, is generated. 
Without loss of generality, we assume that Ti is the unique tree that contains 
the representative vertex . We call T 2 . . . lower trees of J- and any other 
subtree of the forest, T\ included, upper tree. As to maintain the validity of the 
hierarchy tree that we are generating, the algorithm has to deal separately with 
upper and lower trees in order not to mix them in a same cluster. Hence, it 
builds Qu clusters containing upper trees and gi = g — gu clusters with only lower 
trees. The cluster containing T\ is the only cluster allowed to contain trees of 
both types thanks to the fact that the t-divider, which belongs to Ti, connects 
any lower tree to T\ itself. The number g^ is chosen proportionally to the total 
rank of the upper trees. Once gu and gi have been chosen, algorithm BC makes 
two consecutive calls to procedure Scheduling: the first call partitions the up- 
per trees into gu clusters, while the second one partitions the lower trees into gi 
clusters. The cluster containing Ti may be also augmented during this phase, if 
convenient. 

Algorithm BalancedClustering is able to effectively balance the cardinali- 
ties of clusters, as FigureO^c) and the experiments presented in Section Elsuggest. 
This good result is obtained in spite of loosing the property of connectivity. How- 
ever, we remark that the disconnectivity of clusters remains “weak”, because the 
distance on the original tree between any pair of representative vertices in the 
same cluster is always equal to 2 (cfr. Theorem^. 



4 Two Well-Known Clustering Algorithms 

In this section we briefly recall two well-known clustering algorithms that we 
implemented in our package after adapting them to work on general trees. 

The first algorithm, due to Gonzalez P!> is designed for clustering a set 
of points into a fixed number k of clusters and has been in wide use during 
the last twenty years. In the first step Gonzalez’s algorithm picks an arbitrary 
point to be the representative of the first cluster. In each of the next k — 1 
steps a new representative point is chosen through a max-min selection: the f-th 
representative is a point which maximizes the minimum distance from the first 
i — 1 representatives. Gluster Si consists then of all points closer to the f-th 
representative than to any other one. When specialized to trees, the farthest- 
point algorithm (FP) considers as points the vertices of the tree and replaces the 
Euclidean distance between two points with the length of the path between the 
corresponding vertices on the tree. In our implementation the choice of the first 
representative vertex is not random: indeed, a vertex maximizing the average 
distance from all the other vertices is chosen. It is easy to see that this vertex 
is always a leaf. We observed in the experimentation that this choice greatly 
improves the behavior of the algorithm. 
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The second algorithm (FR) is due to Frederickson and generates con- 
nected clusters with limited cardinalities, thus producing quite balanced hier- 
archy trees. Even if it was designed for clustering cubic trees, it can be easily 
generalized to trees with bounded degree. Vertices are partitioned on the basis 
of the topology of the tree with a bottom-up strategy that grows up the clusters 
in a recursive way starting from the leaves. A cluster is built when its current 
cardinality falls into a prefixed range [z, D{z— 1) -|- 1], where D is the tree degree 
and z is a value to be suitably chosen. In our implementation z is selected in 
{l..n} in order to limit the degree g of HT as much as possible: if all the val- 
ues for z produce more than g clusters, we choose the value that minimizes the 
number of created clusters. Since algorithm FR is not always able to satisfy the 
requirements on g, the degree of the hierarchy trees it returns is not bounded. 

5 Experimental Setup 

We implemented the tree clustering algorithms described in SectionOland in Sec- 
tion ^in ANSI C and we debugged and visualized them in the algorithm anima- 
tion system Leonardo 0 . To prove the feasibility of the clustering approach for 
the visualization of large trees, we also realized a prototype of a visualization fa- 
cility which hinges upon this approach, letting the user navigate through the hier- 
archy tree by expanding and shrinking clusters. The prototype, comprehensive of 
algorithm implementations, visualization code, and tree generator, is available at 
the URL http : //www. dsi .uniromal . it/~f inocchi/ experim/treeclust-2 . 0/ 
In the rest of this section we review some interesting aspects of our experimental 
framework related to instance generation and performance indicators. 

The random tree generator. The problem instances that we used in the tests 
are synthesized and have been randomly generated. Since both the degree and 
the balancing of the input tree appeared to us to be crucial factors for the 
performances of the t-divider based algorithms, we designed a tree generator 
which produces structured instances taking into account these parameters. The 
generator works recursively and takes as input four arguments, named n, d, D, 
and j3. n is the number of vertices of the tree T to be built; d and D are a lower 
and an upper bound for its degree, respectively; /3 is the unbalancing factor of T, 
i.e., a real number in [0, 1] which indicates how much T must be unbalanced: the 
bigger is j3, the more unbalanced will be T. Let us suppose that during a generic 
step the generator has to build a tree T, rooted at a vertex v, with parameters n, 
d, D, and /3. If n — 1 < d, then all the n—1 vertices are considered as children of 
v: hence, T can have vertices with degree smaller than d, though all the children 
of such vertices are always leaves. If n — 1 > d, the number r of children of v is 
randomly chosen by picking up a value in [d, mm{n — 1, D}] and each of the n—1 
vertices of T (except for v) is randomly assigned to any of the r subtrees. We 
call Hi the number of vertices assigned to the subtree Ti rooted at the i-th child. 
If no Ui is > [/3 • nj , a further step to make T more unbalanced is performed. If 
w.l.o.g. ni = maxi<i<rni, exactly [/3 • nJ — ni vertices are randomly removed 
from the subtrees which they belong to and are added to Ti. This construction 
satisfies the unbalancing factor (d required for vertex v. 

Performance indicators. We considered three kinds of quality measures, aimed 
at highlighting different aspects of the structure of the hierarchy trees grown up 
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by the algorithms. In particular, based on the considerations in Section |2 we 
studied balancing, depth, and degree properties. 

Measure Unbalancing estimates how much a hierarchy tree is balanced and 
is computed according to the following idea: the children of a cluster c in HT 
should have “not too different” cardinalities. Let n be the cardinality of c, let 
d{c) be its degree, and let Ui be the cardinality of the *-th child of c. In the most 
favourable scenario, each child of c covers vertices. A good estimate of the 
unbalancing of cluster c is obtained by counting how much the real cardinalities 
of its children are different from the optimal child cardinality . The quantity 

J2i=i Wi ~ then normalized with respect to n, so as not to count bigger 

contributions for clusters covering more vertices. Measure Unbalcuicing is the 
sum over all the clusters in the hierarchy tree of the values defined above and is 
averaged on the total number of clusters of HT which are not leaves. The lower 
is the value of this measure, the more balanced is the hierarchy tree. We remark 
that we suitably designed measure Unbalancing to be able to correctly compare 
even hierarchy trees having internal nodes with different degrees. 

A second performance indicator relates to the depth of the hierarchy tree and 
is referred to as External Path Length. It represents the sum of the depths of 
the leaves of the hierarchy tree. In this case it is not necessary to average this 
measure on the number of leaves, which is always the same for all the hierarchy 
trees associated to the same tree instance. Finally, measures Average Degree 
and Maximum Degree refer to the average and to the maximum degree of the 
hierarchy tree, respectively. 

6 Experimental Results 

In this section we empirically study the behavior of the previously described tree 
clustering algorithms, using different values of t in the analysis of the t-divider 
based ones. The main goals of our experiments are the following: (a) comparing 
the algorithms, studying if/how/which algorithms are affected by the degree and 
balancing of the input tree and giving insight on the choice of a “good” algorithm 
for a given value of g and a given problem instance; (b) validate/disprove the 
practical utility of the theoretical concept of t-divider; (c) checking if the idea of 
relaxing the connectivity of clusters and the use of Graham’s heuristic effectively 
help to improve balancing of the hierarchy tree. 

The experimentation contemplates three kinds of tests, where either the num- 
ber n of vertices, or the unbalancing factor f3, or the maximum degree D of the 
input tree vary. All the experiments used from 50 to 100 trials per data point and 
the sequence of seeds used in each test was randomly generated starting from 
a base seed. The algorithms based on t-dividers are referred to in the charts by 
means of their name (SC, CC, or BC) followed by the value t that they employ. 
We sometimes report the results of the tests for different values of g\ though 
the general trend of the curves is maintained, the relative performances of the 
algorithms often vary. We preliminarly remark that the values of the metrics are 
usually smaller for bigger values of g\ roughly speaking, we can explain this by 
observing that, when g is bigger, the algorithms have usually more possibilities 
for decomposing the original tree and so can do it better. We now give more 
insight on the tests we performed. 
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Fig. 4. Utility of t-dividers. 



Experiments on t-dividers. Figure 0 presents the results of the comparison of 
algorithms BC and CC for different values of t, i.e., 2 < t < 6. The graph 
on the left part of Figure 0 shows that algorithm BC has an almost constant 
behavior as the instance size increases and that the choice of t greatly affects 
its performances. In particular, the best results are obtained for t = 3, thus 
proving that the use of centroids (t = 2), which is frequent in literature, does 
not necessarily yield the best solutions. This is even more evident in the graph 
on the right part of Figure where the interrelation between the balancing of 
the input tree and the best choice for t is analyzed w.r.t. algorithm CC: t = 2 
represents the worst choice for quite balanced trees, i.e., for [3 < 0.6. It is worth 
to point out that all the curves have a local maximum in /3 = This trend 
depends on the choice of the t-divider vertex: for a fixed t, as long as /3 < 
the same vertex continues to be chosen as t-divider, while the tree becomes more 
and more unbalanced. As soon as j3 gets bigger than the vertex chosen as 
t-divider changes, implying a more balanced partition of the vertices. Similar 
results have been obtained for different parameters of the experiments and for 
algorithm SC, as well. Hence, in the tests presented later on, for each algorithm 
we will limit to report the chart corresponding to the best choice of t. 

Algorithms’ eomparison. The first test that we discuss analyzes the behavior of 
the algorithms when the maximum degree of the input tree increases. The charts 
in Figure 0 have been obtained by running the algorithms on 300- vertex trees 
with unbalancing factor 0.5, minimum degree 2, and maximum degree ranging 
from 3 to 30 = ^ (this implies that the average degree increases, too) . Each row 
reports the graphs obtained by running the algorithms with a different value g 
of the maximum degree required for HT, 2 < g < A. A first reading of the charts 
is sufficient to grasp the trend of curves and to separate the algorithms into two 
well distinct groups: algorithm SC and algorithm BC are able to benefit of the 
increase of the degree, while algorithms CC, FP, and FR get worse as D in- 
creases. This behavior, that holds w.r.t. both Unbalancing and External Path 
Length, strenghtens the following intuition: if D is large, the t-dividers found by 
the algorithms at each step will probably have big degrees and so will disconnect 
the tree into numerous clusters of small cardinalities. This immediately clarifies 
why SC gets better. For what concerns the balanced approach, it is then easy 
to recombine numerous small clusters to form exactly g clusters, while the con- 
nected strategy is not well suited for this: due to the necessity of maintaining the 
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D = max degree of the input tree D = max degree of the input tree 



Fig. 5. Experiments for increasing maximum degree of the input tree: charts on each 
row are for g = 2,3, 4, respectively. 



connectivity of clusters, it will usually obtain an unbalanced partition consisting 
of g — 1 small clusters and a single big cluster. The worsening behaviour of FP 
and FR can be explained similarly to that of algorithm CC. However, we re- 
mark that algorithm FR obtains very good results w.r.t. measure Unbalancing, 
especially for larger value of g, despite the increasing trend of its curve. 

It is also worth observing that, as we increase the value of g keeping the 
other parameters fixed, the curves get closer and their relative position changes. 
Let us focus, for instance, on the unbalancing measure concerning algorithms 
BC and FP: the related curves do not cross for g = 2, a crossing is introduced 
in _D = 5 when g = 3, and such a crossing point moves forward as g gets bigger 
{D = 12 for 5 = 4). We underline that before the crossing point algorithm FP 
exhibits a better behavior than algorithm BC. This suggests that the choice 
of the best algorithm to use on a given instance is much influenced by a subtle 
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D = max degree of the input tree D = max degree of the input tree 



Fig. 6. Algorithms’ comparison w.r.t. average and maximum degree of the hierarchy 
tree. In both the experiments g — 2. 



interplay between the maximum degree g allowed for HT and the average degree 
of the input tree. Similar experiments should be reported for different choices 
of parameter /? in order to be able to recognize, given a problem instance and a 
value for g, whether FP is preferable to BC. Due to the lack of space we do not 
report such graphs in this paper. 

Even if algorithms FR and SC behave very well w.r.t. measures Unbalancing 
and External Path Length, as Figure Elshows, we have to notice that both of 
them do not limit the degree of the hierarchy tree, being often impractical in 
graph drawing applications. Figure 0 highlights that both the average and the 
maximum degree of the hierarchy trees returned by these algorithms are much 
bigger that the value required for g. 

We also studied how the relative performances of the algorithms are affected 
by the instance size and by the balancing of the input tree (see Figure 0). In the 
left chart we considered 300-vertex trees with degree in [2, 3] and we increased 
the value of the unbalancing factor /3 from 0.2 to 0.95, with step 0.05. We found 
that algorithm SC is much influenced by the growth of /3, suggesting that unbal- 
anced trees are more difficult to partition than balanced ones. In the right chart 
we therefore considered the same range for the degree and we fixed (5 to 0.8; we 
varied the vertices in the range 50 to 500. With respect to measure Unbalancing 
no algorithm is affected by the increase of n, i.e., the measure is almost constant. 
This depends on the fact that, for a fixed /?, smaller trees returned by our gen- 
erator locally reflect the global structure of larger ones, yielding the algorithms 
to a similar behavior. The relative positions of the unbalancing curves are not 
very surprising. As the algorithms are concerned, it is easy to see that both al- 
gorithm CC and algorithm FP do not take into account balancing problems at 
all, possibly obtaining very unbalanced trees. Also the simple clustering strategy 
is affected by the balancing of the input instances and this is conveyed on the 
hierarchy trees. As expected from theoretical considerations, algorithms BC and 
FR come out to be the best ones w.r.t. the choice of parameters in this test. 

7 Conclusions 

In this paper we have focused on the problem of experimentally evaluating with 
respect to criteria relevant in graph drawing applications the quality of hierarchi- 
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Fig. 7. Algorithms’ comparison for increasing unbalancing and increasing number of 
vertices of the input tree. 



cal decompositions of trees built using different clustering subroutines. We have 
designed new tree clustering algorithms hinging upon the notion of t-divider and 
we have empirically shown the relevance of this concept as a generalization of 
the ideas of centroid and separator. In particular, our experiments prove that 
the use of centroids, that is very frequent in literature, does not usually yield 
the best solution in practice. From the comparison of the algorithms’ behavior, 
it comes out that the choice of the best algorithm to use on a given tree instance 
depends on structural properties of the instance (e.g., balancing and degree) and 
on the quality measure to be optimized on the hierarchy tree, either balancing 
or depth or degree. Algorithms FR and SC turn out to behave very well w.r.t. 
balancing and depth, respectively, but they do not limit the degree of the hier- 
archy tree, being often impractical for graph drawing purposes. Algorithm BC 
obtains fairly good results w.r.t. all the performance indicators we considered. 

We are currently working on extending the concepts and ideas presented 
throughout the paper to edge-weighted trees; this is a better model of real situ- 
ations, since the edge weights represent the level of correlation between vertices. 
From an experimental point of view, we are also studying the quality of hier- 
archical decompositions associated to trees obtained from real-life applications. 
The interested reader can download our visualization prototype at the URL 
http : //www. dsi .uniromal . it /'finocchi/ experim/treeclust-2 . 0/. 
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Abstract. Speed-up techniques that exploit given node coordinates have 
proven useful for shortest-path computations in transportation networks 
and geographic information systems. To facilitate the use of such tech- 
niques when coordinates are missing from some, or even all, of the nodes 
in a network we generate artificial coordinates using methods from graph 
drawing. Experiments on a large set of German train timetables indicate 
that the speed-up achieved with coordinates from our uetwork drawings 
is close to that achieved with the actual coordinates. 



1 Introduction 

In travel-planning systems, shortest-path computations are essential for answer- 
ing connection queries. While still computing the optimal paths, heuristic speed- 
up techniques tailored to geographic networks have been shown to reduce re- 
sponse times considerably [20, 19] and are, in fact, used in many such systems. 

The problem we consider has been posed by an industrial partner^ who is a 
leading provider of travel planning services for public transportation. They are 
faced with the fact that quite often, much of the underlying geography, i.e. the 
location of nodes in a network, is unknown, since not all transport authorities 
provide this information to travel service providers or competitors. Coordinate 
information is costly to obtain and to maintain, but since the reduction in query 
response time is important, other ways to make the successful geometric speed- 
up heuristics applicable are sought. 

The existing, yet unknown, underlying geography is reflected in part by travel 
times, which in turn are given in the form of timetables. Therefore, we can con- 
struct a simple undirected weighted graph in the following way. Each station 
represents a vertex, and two vertices are adjacent, if there is a non-stop connec- 
tion between the two corresponding stations. Edge weights are determined from 
travel times, thus representing our distance estimates. Reasonable (relative) lo- 
cation estimates are then obtained by embedding this graph in the plane such 
that edge lengths are approximately preserved. 

* Research partially supported by the Deutsche Forschungsgemeinschaft (DFG) under 
grant WA 654/10-4. 

^ HaGon Ingenieurgesellschaft mbh, Hannover. 
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Our specific scenario and geometric speed-up heuristics for shortest-path 
computations are reviewed briefiy in Sect. 2. In Sect. 3, we consider the spe- 
cial case in which the locations of a few stations are known and show that a 
simple and efficient graph drawing technique yields excellent substitutes for the 
real coordinates. This approach is refined in Sect. 4 to be applicable in more 
general situations. In Sect. 5, both approaches are experimentally evaluated on 
timetables from the German public train network using a snapshot of half a 
million connection queries. 

2 Preliminaries 

Travel information systems for, e.g., car navigation [21,12] or public trans- 
port [17,18,22], often make use of geometric speed-up techniques for shortest- 
path computations. We consider the (simplified) scenario of a travel information 
system for public railroad transport used in a recent pilot study [19]. It is based 
solely on timetables; for each train there is one table, which contains the depar- 
ture and arrival times of that train at each of its halts. In particular, we assume 
that every train operates daily. 

The system evaluates connection queries of the following kind: Given a de- 
parture station A, a destination station B, and an earliest departure time, find 
a connection from A to B with the minimum travel time (i.e., the difference 
between the arrival time at B and the departure time at A). 

To this end, a (directed) timetable graph is constructed from timetables in a 
preprocessing step. For each departure and arrival of a train there is one vertex 
in the graph. So, each vertex is naturally associated with a station, and with a 
time label (the time the departure or arrival of the train takes place). There are 
two different kinds of edges in the graph: 

— stay edges: The vertices associated with the same station are ordered accord- 
ing to their time label. Then, there is a directed edge from every vertex to 
its successor (for the last vertex there is an edge to the first vertex) . Each of 
these edges represents a stay at the station, and the edge length is defined 
by the duration of that stay. 

— travel edges: For every departure of a train there is a directed edge to the 
very next arrival of that train. Here, the edge length is defined to be the 
time difference between arrival and departure. 

Answering a connection query now amounts to finding a shortest path from 
a source to one out of several target vertices: The source vertex is the first vertex 
at the start station representing a departure that takes place not earlier than the 
earliest departure time, and each vertex at the destination station is a feasible 
target vertex. 

2.1 Geometric Speed-Up Techniques 

In [19], Dijkstra’s algorithm is used as a basis for these shortest-path computa- 
tions and several speed-up techniques are investigated. We focus on the purely 
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geometric ones, i.e. those based directly on the coordinates of the stations, which 
can be combined with other techniques. 



Goal-directed search. This strategy is found in many textbooks (e.g., see [1, 
16]). For every vertex v, a lower bound b(v) satisfying a certain consistency 
criterion is required for the length of a shortest path to the target. In a timetable 
graph, a suitable lower bound can be obtained by dividing the Euclidean distance 
to the target by the maximum speed of the fastest train. Using these lower 
bounds, the length of each edge is modified to A'^-^ = A^^ yj — b(u) + b(v). 

It can be shown that a shortest path in the original graph is a shortest path in 
the graph with the modified edge lengths, and vice versa. If Dijkstra’s algorithm 
is applied to the modified graph, the search will be directed towards a correct 
target. 



Angle restriction. In contrast to the goal-directed search, this technique re- 
quires a preprocessing step, which has to be carried out once for the timetable 
graph and is independent of the subsequent queries. For every vertex v repre- 
senting the departure of a train, a circle sector C(v) with origin at the location 
of the vertex is computed. That circle sector is stored using its two bounding 
angles, and has the following interpretation: If a station A is not inside the circle 
sector C(v), then there is a shortest path from v to A, which starts with the 
outgoing stay edge. 

Hence, if Dijkstra’s algorithm is applied to compute a shortest path to some 
destination station D, if some vertex u is processed, and D is not inside the 
circle sector C{u), then the outgoing travel edge can be ignored, because there 
is a shortest path from u to D starting with the stay edge. 



2.2 Estimating Distances from Travel Times 

The location of stations is needed to determine lower bounds for goal-directed 
search, or circle sectors for the angle-restriction heuristic. If the actual geo- 
graphic locations are not provided, the only related information available from 
the timetables are travel times. We use them to estimate distances between 
stations that have a non-stop connection, which in turn are used to generate 
locations suitable for the geometric heuristics, though in general far from being 
geographically accurate. 

The (undirected, simple) station graph of a set of timetables contains a vertex 
for each station listed, and an edge between every pair of stations connected by 
a train not stopping inbetween. The length Ag of an edge e in the station graph 
will represent our estimate of the distance between its endpoints. 

Distance between two stations can be expected to be roughly a linear function 
in the travel time. However, for different classes of trains the constant involved 
will be different, and closely related to mean velocity of trains in this class. We 
therefore estimate the length of an edge e in the station graph, i.e. the distance 
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between two stations, using the mean over all non-stop connections inducing this 
edge of their travel time times the mean velocity of the vehicle serving them. 

Mean velocities have been extracted from the data set described in Sect. 5, 
for which station coordinates are known. For two train categories, the data are 
depicted in Fig. 1, indicating that no other function is obviously better than our 
simple linear approximation. Note that all travel times are integer, since they are 
computed from arrival and departure times. As a consequence, slow trains are 
often estimated to have unrealistically high maximum velocities, thus affecting 
the modified edge lengths in the goal-directed search heuristic. 





travel time [min] 



(a) high-speed trains (b) local trains 

Fig. 1. Euclidean distance vs. travel time for non-stop connections. For both service 
categories, all data points are shown along with the average distance per travel time 
and a linear interpolation 



3 Networks with Partially Known Geography 

In our particular application, it may occasionally be the case that the geographic 
locations of at least some of the major hubs of the network are known, or can be 
obtained easily. We therefore first describe a simple method to generate coordi- 
nates for the other stations that exploits the fact that such hubs are typically 
well-distributed and thus form a scaffold for the overall network. Our approach 
for the more general case, described in the next section, can be viewed as an 
extension of this method. 

Let p = (pv)y^Y be a vector of vertex positions, then the potential function 

Ub{p) 'y ^ ^{u,v} * WPu Pv\\ (1) 

{u^v}£E 

where We = e G E, weights the influence of an edge according to its estimated 
length Ae, defines a weighted barycentric layout model [25]. This model has an 
interesting physical analogy, since each of the terms in (1) can be interpreted as 
the potential energy of a spring with spring constant uie and ideal length zero. 
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A necessary condition for a local minimum of Ub(p) is that all partial deriva- 
tives vanish. That is, for all = (x„, j/„), v &V,^e have 



1 



Xy 



Uv = 



u : 

1 

u : {u^v}^E 



^ ^ ^{u,v} * 

u : {u,v}^E 

^ ^ ^{u,v} ‘ Uu- 

u : {u,v}^E 



In other words, each vertex must be positioned in the weighted barycenter of 
its neighbors. It is well known that this system of linear equations has a unique 
solution, if at least one pv in each connected component of G is given (and 
the equations induced by v are omitted) [3]. Note that, in the physical analogy, 
this corresponds to fixing some of the points in the spring system. Moreover, the 
matrix corresponding to this system of equations is weakly diagonally dominant, 
so that iterative equation solvers can be used to approximate a solution quickly 
(see, e.g., [10]). 

Assuming that the given set of vertex positions provides the cornerstones 
necessary to unfold the network appropriately, we can thus generate coordinates 
for the other vertices using, e.g., Gauss-Seidel iteration, i.e. by iteratively placing 
them at the weighted barycenter of their neighbors. Figure 2 indicates that this 
approach is highly dependent on the set of given positions. As is discussed in 
Sect. 5, it nevertheless has some practical merits. 




Fig. 2. Barycentric layout of an 72 x 72 grid with the four corners fixed, and the same 
grid with 95 and with 10 randomly selected vertices fixed 



4 A Specific Layout Model for Connection Networks 

The main drawbacks of the barycentric approach are that all vertices are posi- 
tioned inside of the convex hull of the vertices with given positions, and that the 
estimated distances are not preserved. In this section, we modify the potential (1) 
to take these estimates into account. 

Recall that each of the terms in the barycentric model corresponds to the 
potential energy of a spring of length zero between pairs of adjacent vertices. 
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Kamada and Kawai [13] use springs of length = dc{u, v), i.e. equal to the 

length of a shortest path between u and v, between every pair of vertices. The 
potential then becomes 



Ukk{p) ^ ^ ^{u,v} ‘ (IIPu Pv\\ , 

u,v^ V 



(2) 



the idea being that constituent edges of a shortest path in the graph should form 
a straight line in the drawing of the graph. To preserve local structure, spring 
constants are chosen as Wg = so that long springs are more flexible than short 
ones. (The longer a path in the graph, the less likely are we able to represent it 
straight.) Note that this is a special case of multidimensional scaling, where the 
input matrix contains all pairwise distances in the graph. 

This model certainly does reflect our layout objectives more precisely. Note, 
however, that it is AfP-hard to determine whether a graph has an embedding 
with given edge lengths, even for planar graphs [7]. In contrast to the barycentric 
model, the necessary condition of vanishing partial derivatives leads to a system 
of non-linear equations, with dependencies between x- and y-coordinates. There- 
fore, we can no longer iteratively position vertices optimally with respect to the 
temporarily fixed other vertices as in the barycentric model. As a substitute, 
a modified Newton-Raphson method can be used to approximate an optimal 
move for a single vertex [13]. Since this method does not scale to graphs with 
thousands of vertices, we next describe our modifications to make it work on 
connection graphs.^ 



Sparsening. If springs are introduced between every pair of vertices, a single 
iteration takes time quadratic in the number of vertices. Since at least a linear 
number of iterations is needed, this is clearly not feasible. Since, moreover, we are 
not interested in a readable layout of the graph, but in supporting the geometric 
speed-up heuristics for shortest-path computations, there is no need to introduce 
springs between all pairs of vertices. 

We cannot omit springs corresponding to edges, but in connection graphs, 
the number of edges is of the order of the number of vertices, so most of the 
pairs in (2) are connected by a shortest path with at least two edges. If a train 
runs along a path of k edges, we call this path a k-connection. To model the 
plausible assumption that, locally, trains run fairly straight, we include only 
terms corresponding to edges (or 1-connections) and to 2- and 3-connections 
into the potential. Whenever there are two or more springs for a single pair of 
vertices, they are replaced by a single spring of the average length. For realistic 
data, the total number of springs thus introduced is linear in the number of 
vertices. 

^ In the graph drawing literature, similar objective functions have been subjected to 
simulated annealing [5, 4] and genetic algorithms [14, 2]. These methods seem to scale 
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Long-range dependencies. Since we omit most of the long-range dependen- 
cies (i.e. springs connecting distant pairs of vertices), an iterative method starting 
from a random layout is almost surely trapped in a very poor local minimum. 

We therefore determine an initial layout by computing a local minimum 
of the potential on an even sparser graph, that does only include the long- 
range dependencies relevant for our approach. That is, we consider the subgraph 
consisting of all stations that have a fixed position or are a terminal station 
of some train, and introduce springs only between the two terminal stations of 
each train, and between pairs of the selected vertices that are consecutive on the 
path of any train. We refer to these additional pairs as long-range connections. 
In case the resulting graph has more components than the connection graph, we 
heuristically add some stations touched by trains inducing different components 
and the respective springs. After running our layout algorithm on this graph 
(initialized with a barycentric layout), the initial position for all other vertices is 
determined from a barycentric layout in which those positions that have already 
been computed are fixed. 



Iterative improvement. We compute a local minimum of a potential U{p) 
by relocating one vertex at a time according to the forces acting on it, i.e. the 
negative of the gradient, —VU{p). 

For each node v (in arbitrary order) we move only this node in dependence 
of U{pv). The node is shifted in the opposite direction of 



d := VC/(p„) := 



/ du (py) dU{pv) du (py) 

\ dxy ^Uv dzy 



A substantial parameter of a gradient descent is the size of each step. For 
small graphs it is often sufficient to take a fixed multiple of the gradient (see 
the classic example of [6]), while others suggest some sort of step size reduction 
schedule (e.g., see [8]). We applied a more elaborate method that is robust against 
change of scale, namely the method of Wolfe and Powell (see, e.g., [23, 15]). The 
step size a € (0,oo) is determined by 



VU(py — ad)d ^ 

vu{p,)d - 

U{pv) -U{pv- ad) ^ ^ 
(T • yu{pv)d ~ 



for given parameters 6 € (0,0.5) and k € (S,l). Roughly speaking, this guar- 
antees that the potential is reduced and that the step is not too small. In our 
experiments, this method clearly outperformed the simpler methods both in 
terms of convergence and overall running time. 



Another dimension. Generally speaking, a set of desired edge lengths can be 
resembled more closely in higher-dimensional Euclidean space. Several models 
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make use of this observation by temporarily allowing additional coordinates and 
then penalizing their deviation from zero [24] or projecting down [9]. 

We use a third coordinate during all phases of the layout algorithm, but 
ignore it in the final layout. Since projections do not preserve the edge lengths, 
we use a penalty function "'^here Ct is the penalty weight at the tth 

iteration, to gradually reduce the value of the z-coordinate towards the end of the 
layout computation. Experimentation showed that the final value of the potential 
is reduced by more than 10% with respect to an exclusively two-dimensional 
approach. 

In summary, our layout algorithm consists of the following four steps: 

1. barycentric layout of graph of long-range connections 

2. iterative improvement 

3. barycentric layout of graph of 2-, 3-, and long-range connections 

4. iterative improvement with increasing z-coordinate penalties 

In each of these steps, the iteration is stopped when none of the stations was 
moved by more than a fixed distance. Figure 3 shows the results of this approach 
when applied to the graph of Fig. 2. 




Fig. 3. Layouts of the graph of Fig. 2, where fictitious trains run along grid lines, under 
specific model 



5 Results and Discussion 

Our computational experiments are based on the timetables of the Deutsche 
Bahn AG, Germany’s national train and railroad company, for the winter pe- 
riod 1996/97. It contains a total of 933,066 arrivals and departures on 6,961 sta- 
tions, for which we have complete coordinate information. 

To assess the quality of coordinates generated by the layout algorithms de- 
scribed in Sects. 3 and 4, we used a snapshot from the central travel information 
server of Deutsche Bahn AG. This data consists of 544,181 queries collected over 
several hours of a regular working day. 

This benchmark data are unique in the sense that it is the only real network 
for which we have both coordinates and query data. 
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In the experiments, shortest paths are computed for these queries using our 
own implementation of Dijkstra’s algorithm and the angle-restriction and goal- 
directed search heuristics. All implementations are in C or C-| — h, compiled with 
gcc version 2.95.2. 

From the timetables we generated the following instances: 

— de-org (coordinates known for all stations) 

— de-22-important (coordinates known for the 22 most important^ stations) 

— de-22-random (coordinates known for 22 randomly selected stations) 

— de (no coordinates given) 

For these instances, we generated layouts using the barycentric model of 
Sect. 3 and the specific model of Sect. 4, and measured the average core CPU 
time spent on answering the queries, as well as the number of vertices touched by 
the modified versions of Dijkstra’s algorithm. Each experiment was performed 
on a single 336 Mhz UltraSparc-II processor of a Sun Enterprise 4000/5000 
workstation with 1024 MB of main memory. The results are given in Tab. 1, and 
the layouts are shown in Figs. 4-5. 



Table 1. Average query response times and number of nodes touched by Dijkstra’s 
algorithm. Without coordinates, the average response time is 104.0 ms (18406 nodes) 
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(Fig. 6) 
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16.8 
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barycentric 


18.1 


5017 
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19.3 
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(Fig. 7) 


specific 


20.7 


5383 


88.0 


12388 


17.9 


3985 



The results show that the barycentric model seems to match very well with 
the angle-restriction heuristic when important stations are fixed. The somewhat 
surprising usefulness of this simple model even for the randomly selected stations 
seems to be due to the fact that our sample spreads out quite well. It will be 
interesting to study this phenomenon in more detail, since the properties that 
make a set of stations most useful are important for practical purposes. 

The specific model appears to work well in all cases. Note that the aver- 
age response time for connection queries is reduced by 60%, even without any 
knowledge of the underlying geography. With the fairly realistic assumption that 
the location of a limited number of important stations is known, the speed-up 
obtained with the actual coordinates is almost matched. 

® Together with the coordinate information, there is an value associated with each 
station that indicates its importance as a hub. The 22 selected stations have the 
highest attained value. 
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Fig. 4. de-org 




fV* 



Fig. 5. de 



To evaluate whether the specific model achieves the objective to preserve 
given edge length, we generated additional instances from de-org by dropping a 
fixed percentage of station coordinates ranging from 0-100%, while setting Ae to 
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barycentric specific 



Fig. 6. de-22-important 




barycentric specific 



Fig. 7 . de-22-random 

its true value. As can be seen in Fig. 8, these distances are reconstructed quite 
well. 

As of now, we do not know how our method compares with the most re- 
cent developments in force-directed placement for very large graphs [9, 26] , in 
particular the method of [11]. 
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percentage of stations without coordinates 



Fig. 8. Minimum, mean, and maximum relative error in edge lengths, averaged over 
ten instances each 
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Measurement-driven networking research can be viewed as a reaction to the 
wide-spread belief within the networking community that much of conventional 
networking research has only given lip-service to the importance of network mea- 
surements and, as a result, has become less and less successful in understanding, 
explaining, and handling the increasingly complex and large-scale nature of mod- 
ern communication networks. In this talk, I will illustrate with three examples 
why and how measurement-based findings about the actual dynamics of Internet 
traffic have opened up new avenues of research into algorithm engineering and 
experiments. 

The first example originates from the discovery of the self-similar scaling be- 
havior (over large time scales) of aggregate traffic as measured on a link within 
the network p. It is based on the fact that the observed self-similarity phe- 
nomenon is caused by the high-variability (i.e., heavy-tailed distribution with 
infinite variance) of the sizes (e.g., number of packets) of the individual user ses- 
sions, TCP connections, or IP flows that make up the aggregate traffic 0. The 
ubiquity of traffic-related high-variability phenomena is in stark contrast with 
the traditionally assumed finite variance distributions (e.g., exponential) and 
has led to the design, analysis, and implementation of new algorithms for such 
engineering problems as CPU load balancing in networks of workstations |5j, 
connection scheduling in Web servers 0, load-sensitive routing in IP networks 
0, and detecting certain kinds of network attacks 0. 

The second example is motivated by the more recent empirical findings that 
measured TCP/IP traffic over small time scales exhibits scaling behavior that 
is more complex than the self-similar behavior observed over large time scales, 
and that these scaling properties over small time scales are intimately related 
to the closed loop nature of TCP and very likely also to certain features of 
the topology of the Internet 0. I will argue that this finding plays havoc with 
the traditional open loop modeling approaches that have provided the basis for 
countless algorithms for evaluating, improving, or optimizing various aspects of 
network performance. In particular, novel approaches are required for adequately 
accounting for feedback when evaluating the performance of closed loop systems 
such as a TCP/IP-dominated Internet jSj- 

Finally, for the third example, we turn to the very problem of uncovering 
dominant characteristics of some aspects of Internet topology. In this context, 
measurement-driven research performed over the past two years has led to the 
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discovery of wide-spread high-variability phenomena associated with the con- 
nectivity graph structure of Autonomous Systems (ASs) in today’s Internet 0. 
The resulting power-law graphs (i.e., the node degree follows a heavy-tailed dis- 
tribution with infinite variance) are drastically different from the traditionally 
considered Renyi-Erdos random graphs (where the node degree follows approx- 
imately a Poisson distribution) and promise to be a gold mine for research ac- 
tivities related to the development and study of algorithms that fully exploit 
the special connectivity structure underlying power-law graphs but are likely to 
perform lousy when applied to the familiar and well-studied family of random 
graphs m- 
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Abstract. Range searches in metric spaces can be very difficult if the 
space is “high dimensional”, i.e. when the histogram of distances has 
a large mean and/or a small variance. This so-called “curse of dimen- 
sionality” , well known in vector spaces, is also observed in metric spaces. 
There are at least two reasons behind the curse of dimensionality: a large 
search radius and/or a high intrinsic dimension of the metric space. We 
present a general probabilistic framework based on stretching the trian- 
gle inequality, whose direct effect is a reduction of the effective search 
radius. The technique gets more effective as the dimension grows, and 
the basic principle can be applied to any search algorithm. In this paper 
we apply it to a particular class of indexing algorithms. We present an 
analysis which helps understand the process, as well as empirical evi- 
dence showing dramatic improvements in the search time at the cost of 
a very small error probability. 



1 Introduction 

The concept of “proximity” searching has applications in a vast number of fields. 
Some examples are multimedia databases, data mining, machine learning and 
classification, image quantization and compression, text retrieval, computational 
biology and function prediction, just to name a few. 

All those applications have some common characteristics. There is a universe 
X of objects, and a nonnegative distance function d : X x X — > R"*" defined 
among them. This distance satisfies the three axioms that make the set a metric 
space: strict positiveness {d{x,y) = 0 x = y), symmetry {d{x,y) = d{y,x)) 
and triangle inequality {d{x,z) < d{x,y) + d{y,z)). This distance is considered 
expensive to compute (think, for instance, in comparing two fingerprints). We 
have a finite database U C X, of size n, which is a subset of the universe of 
objects. The goal is to preprocess the database U to quickly answer (i.e. with 
as few distance computations as possible) range queries and nearest neighbor 

* This work has been partially supported by CYTED VII. 13 AMYRI Project (both 
authors), CONACyT grant R-28923A (first author) and FONDECYT Project 1- 
000929 (second author). 
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queries. We are interested in this work in range queries, expressed as (q,r) (a 
point in X and a tolerance radius) , which should retrieve all the points at distance 
r or less from q, i.e. {m G U / d{u, q) < r}. Other interesting queries are nearest- 
neighbor ones, which retrieve the K elements of U that are closest to q. 

1.1 The Curse of Dimensionality 

A particular case of the problem arises when the metric space is and the 
distance is Minkowski’s Lg. In this case the objects have a geometric meaning 
and the coordinate information can be used to guide the search. 

There are effective methods for vector spaces, such as kd-trees 0, R-trees uni 
or X-trees 0. However, for random vectors on more than roughly 20 dimensions 
all those structures cease to work well. There exist proven lower bounds 0 which 
show that the search complexity is exponential with the dimension. 

It is interesting to point out that the concept of “dimensionality” can be 
translated into metric spaces as well. The typical feature of high dimensional 
spaces in vector spaces is that the probability distribution of distances among 
elements has a very concentrated histogram (with a larger mean as the dimension 
grows). In I5I7I this is used as a definition of intrinsic dimensionality for general 
metric spaces, which we adopt in this paper: 

Definition 1. The intrinsic dimension of a database in a metric space is p = 
^^ 2 , where p and are the mean and variance of its histogram of distances. 

Under this definition, a database formed by random k dimensional vectors 
where the coordinates are independent and identically distributed has intrin- 
sic dimension 0{k) M Hence, the definition extends naturally that of vector 
spaces. 

Analytical lower bounds and experiments in m show that all the algorithms 
degrade systematically as the intrinsic dimension p of the space increases. The 
problem is so hard that it has received the name of “curse of dimensionality”, 
and it is due to two possible reasons. On one hand, if p increases because the 
variance is reduced, then we have that a concentrated histogram of distances 
gives less informatioifl. On the other hand, if p increases because the mean of 
the distribution grows, then in order to retrieve a fixed fraction of the database 
(and also to get a constant number of nearest neighbors) we need to use a larger 
search radius. In most cases both facts hold simultaneously. 

An interesting question is whether a probabilistic or approximate algorithm 
could break the curse of dimensionality or at least alleviate it. Approximate 
and probabilistic algorithms are acceptable in most applications that search in 
metric spaces, because in general the modelization as a metric space already 
carries some kind of relaxation. In most cases, finding some close elements is as 
good as finding all of them. 

This is the focus of this paper. In the next section we review related work 
and put our contribution in context. 

^ In the extreme case we have a space where d{x,x) = 0 and Vy 7 ^ x, d{x,y) = 1, 
where it is impossible to avoid any distance evaluation at search time. 
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2 Related Work and Our Contribution 

Most of the existing approaches to solve the search problem on metric spaces are 
exact algorithms which retrieve exactly the elements of U at distance r or less 
from q. In most of those approaches are surveyed and explained in detail. 

In this work we are more interested in approximate and probabilistic algo- 
rithms, which relax the condition of delivering the exact solution. This relaxation 
uses, in addition to the query, a precision parameter e to control how far away 
(in some sense) we want the outcome of the query from the correct result. 

Approximation algorithms are surveyed in depth in \ I ,'f j . An example is 
(Q, which proposes a data structure for real- valued vector spaces under any 
Minkowski metric Lg. The structure, called the BBD-tree, is inspired in kd- 
trees and can be used to find “(1 -I- e) nearest neighbors”: instead of finding 
u such that d{u,q) < d{v,q) Vw S U, they find u* such that d{u*,q) < 
(1 -I- e)d{v, q) Vz) G U. 

The essential idea behind this algorithm is to locate the query g in a cell 
(each leaf in the tree is associated with a cell in the decomposition) . Every point 
inside the cell is processed so as to obtain its nearest neighbor u. The search stops 
when no promising cells are found, i.e. when the radius of any ball centered at 
q and intersecting a nonempty cell exceeds the radius d(q, p)/(l -I- e) . The query 
time is 0(|"1 -I- 6fc/e]*fclogn), where k is the dimensionality of the vector space. 

Probabilistic algorithms have been proposed for nearest neighbor searching 
only, for vector spaces in w m\ , and for general metric spaces in [2|. We 
explain a couple of representative proposals. 

In |ES]) the data structure is a standard kd-tiee. The author uses “aggressive 
pruning” to improve the performance. The idea is to increase the number of 
branches pruned at the expense of losing some candidate points in the process. 
This is done in a controlled way, so the probability of success is always known. 
The data structure is useful for finding limited radius nearest neighbors, i.e. 
nearest neighbors within a fixed distance to the query. 

In 0, the author chooses a “training set” of queries and builds a data struc- 
ture able to answer correctly only queries belonging to the training set. The 
idea is that this setup is enough to answer correctly, with high probability, an 
arbitrary query using the training set information. Under some probabilistic as- 
sumptions on the distribution of the queries, it is shown that the probability of 
not finding the nearest neighbor is 0{{logn)^ / K), where K can be made arbi- 
trarily large at the expense of 0{Kna) space and 0{Kalogn) expected search 
time. Here a is the logarithm of the ratio between the farthest and the nearest 
pairs of points in the union of the database U and the training set. 

In this paper we present a probabilistic technique for range searching on 
general metric spaces. We exploit the intrinsic dimension of the metric space, 
specifically the fact that on high dimensions the difference between random 
distances is small. Every algorithm to search in metric spaces can make use 
of this property in one form or another so as to become a much more efficient 
probabilistic algorithm. In particular, we choose the most basic algorithm (which 
permits computing easily the probability of making a mistake) and apply the 
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technique to it. We show analytically that the net effect is a reduced search cost, 
corresponding to a smaller search radius. Reducing the search radius has an effect 
which is very similar to that of reducing the dimensionality of the metric space. 
We also present empirical results showing a dramatic increase in the efficiency 
of the algorithm at a very moderate error probability. 



3 Stretching the Triangle Inequality 

A large class of algorithms to search on metric spaces are so-called “pivot based” 
13. Pivot based algorithms (e.g. 1 1 2l4j 1 are built on a single general idea. We 
select k random elements {pi, . . . ,pk} CV, called pivots. The set is preprocessed 
so as to build a table of nk entries, where all the distances d{u,pi) are stored for 
every u C U and every pivot pi. When a query (5, r) is submitted, we compute 
d{q,pi) for every pivot pi and then try to discard elements u £ U by using the 
triangle inequality on the information we have. Two facts can be used: 

d{u,pi) < d{u,q) + d{q,pi) and d{q,pi) < d{q,u) + d{u,pi) (1) 

which can be reexpressed as d{q,u) > \d{u,pi) — d(g,pi)|, and therefore we 
can discard all those u such that \d{u,pi) — d{q,pi)\ > r. Note that by storing 
the table of kn distances and by computing the k distances from q to the pivots 
we have enough information to carry out this elimination. On the other hand, 
the elements u which cannot be eliminated with this rule have to be directly 
compared against q. 

The k distance computations d{q,pi) are called internal evaluations, while 
those computations d{q, u) against elements u that cannot be ruled out with the 
pivot information are called external evaluations. It is clear that, as k grows, 
the internal evaluations grow and the external ones decrease (or at least do not 
increase). It follows that there is an optimum k. However, it is well known |7j 
that even for moderately high dimensions the optimal k is so large that the 
index requires impractical amounts of memory. Therefore, in practice one uses 
the largest k that the available space permits. 

FigureGlillustrates the curse of dimensionality for pivot based algorithms. We 
measure d{q,pi) and discard all the elements falling outside the range d{q,pi)±r. 
This area contains very few elements when the histogram is very concentrated. 
Moreover, it is possible that we need a larger search radius r on higher dimen- 
sions. 

An interesting fact is that the difference between two random distances 
d{u,pi) and d{q,pi) can be seen as a random variable which distributes more 
or less like the histogram (indeed, with mean zero and a variance which is twice 
that of the histogram). This means that, as the intrinsic dimension grows, the 
difference \d(u,pi) — d{q,pi) \ is normally very small. 

Figure El depicts this situation, where the density functions of two random 
variables X and Z are shown. The random variable X represents a random 
distance, while Z represents the difference between two random distances. If we 
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Fig. 1. The histograms of distances for a low and a high dimensional metric space. The 
grayed parts are those that a pivot based algorithm cannot discard. 



want to retrieve a fraction / of the database we need to use a search radius large 
enough to make the area under /jy(0 . . . r) equal to /. 

Now, if X represents the difference between d(q,u) and d{q,pi) for some 
element u of the database, then the probability of discarding u is proportional 
to the area under fz{r... oo). That is, in order to retrieve the double grayed area 
in Figure 121 we have to traverse the single grayed area fraction of the database. 

In high dimensions the density fx shifts to the right while the differences fz 
concentrate around the origin. Hence in order to retrieve the same fraction of 
the database it is necessary to traverse a larger fraction of it as the dimension 
increases. 




Fig. 2. The search complexity of pivoting algorithms. 



If the two histograms do not intersect at all, then retrieving any fraction 
of the database implies traversing all the database. In high dimensions this is 
precisely the situation. To avoid the curse of dimensionality it would be necessary 
to shift fz to the right. This can be done up to some extent by increasing the 
number of pivots (since the real distribution will be the maximum between k 
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variables distributed like Z). There is however a limit on k given by the amount 
of memory available. 

We propose another method to shift the density fz to the right. It consists 
simply of multiplying it by a constant /3 > 1. We “stretch” the triangle inequality 
by multiplying the differences between the distances by fi before using them. In 
practice, we discard all the elements which satisfy 

!3 \d{u,pi) - d{q,pi)\ > r , i.e., \d{u,pi) - d{q,pi)\ > r//3 

which is equivalent to decreasing the search radius while maintaining the discard- 
ing radius. That is, we use r//3 to determine which are the candidate elements, 
but return to r when it comes to check those candidates directly. 

Figure Elillustrates how the idea operates. The exact algorithm selects a ring 
which guarantees that no relevant element is left out, while the probabilistic one 
stretches both sides of the ring and can miss some elements. 



d(p,q)+r 




d(p,q)-r 



Normal 



d(p,q)+rl^ 




d(p,q)-rip 

Probabilistic 



Fig. 3. The rings of the elements that are not discarded under the normal and the 
probabilistic algorithm. 



4 Probability of Error 

The algorithm we have presented is probabilistic with one-sided error, in the 
sense that it can fail to report some elements which are indeed at distance r or 
less to q, but it cannot report elements that are outside the query area. In this 
section we derive the maximum (3 that can be used in order to have a probability 
not larger than e of missing a relevant answer. 

We are firstly interested in computing an upper bound to the probability of 
incorrectly discarding an element u which should be returned by the query (g, r). 
Since d{q,u) < r, then by the triangle inequality we have \d{q,p) — d{u,p)\ < r. 
Let us define two random variables X = d{q,p) and Y = d(u,p). It is easy to 
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see that both variables are distributed according to the histogram of the metric 
space, since they correspond to distances among random elements. However, 
there is a positive covariance between X and Y because they are two distances 
to a single element p. 

Our question is therefore which is the probability oi \X — Y\ > rff3 when 
d{q, u) < r. This last condition has some consequences, such as jX — F| < r. By 
ignoring this fact, we obtain a higher probability, i.e. 

Pr(error) = Pr{\X — Y\ > r / f3 / d{q,u) < r) < Pr{\X — Y\ > r / f3) 



Let us consider the random variable Z = X — Y , which has mean = 0 and 

whose variance tr^ would be 2 (t^ if X and Y were independent, but it is lower 
because because X and Y are positively correlated. 

Using Chebyschev inequality (for a random variable Z, Pr{\Z — p.z\ > x) < 
a^jx'^), we can upper bound the above probability 

2o-2 

Pr{error) < Pr{\X — Y\ > r/l3) < (2) 

(r/py 

We want that the probability of missing a relevant element after using the k 
pivots is at most e. Since an element u can be discarded by any pivot, the proba- 
bility of incorrectly discarding u has to consider the union of all the probabilities. 
What we want is 

1 — (1 — Pr(error))* < e 

which, by using our upper bound of Eq. O) yields an upper bound on /3: 



P < 






( 3 ) 



where we have defined Sk = 1 — (1 — which goes from 0 to 1 together with 

e, but at a faster pace which depends on k. The bound improves as the search 
radius r or the error probability e grow, and worsens as the number of pivots k 
or the variance grow. Worsening as k grows is an unfortunate but unavoidable 
property of the method. It means that as we increase k to improve the efficiency, 
the probabilistic method gets less efficient. We return to this issue later. 

We show now that the bound improves as the intrinsic dimension p of the 
space grows, which is a very desirable property. We use Markov’s inequality (for 
a random variable W , Pr(W > r) < Pw/x). Assume that we want to retrieve a 
fraction / of the database and that W = d{q,u) (i.e. W is distributed according 
to the histogram) for a random u. Then we need to use a search radius r such 
that / = Pr{W < x) > 1 — p/r, or which is the same, x < p/{l — /). Hence we 
pessimistically assume that the search radius is r = p/{l — f). The upper bound 
of Eq. O becomes 



Q ^ d\/^k \/ P^k 

- (wvl^ ~ w 



( 4 ) 



Hence, for a fixed error probability e, the bound gets more permissive as the 
problem becomes harder (larger p or /). 
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Figure 21 shows the probability of retrieving a relevant element as 1//3 grows. 
We show the effect with 1 and 64 pivots. The metric space is formed by 10,000 
random vectors in the cube [0, with the Euclidean (L 2 ) distance. Despite 
that this is in particular a vector space, we treat it as a general metric space. 
The advantage of using a vector space is that we can precisely control its dimen- 
sionality. 



1 pivots, retrleving1.0%oflfedatabase 64 pivots, retrieving 1.0% of the database 





Fig. 4. The probability of retrieving relevant elements as a function of 1//3 



The plots show clearly that the probability of missing an element for a fixed 
(3 increases as the number of pivots grows. This is a negative effect of increasing 
k which is not present on the exact algorithm, and which can make it preferably 
to use a smaller k. A second observation is that, for a fixed j3, the probability 
of error is reduced as the dimension grows. Hence we are permitted to relax 
more the algorithm on higher dimensions. In the experiments that follow, we 
do not use (3 directly anymore, but rather plot the results as a function of the 
probability of retrieving a relevant element. 

5 Efficiency 

Let us now consider the expected complexity for j3 >1. We will establish in fact a 
lower bound, i.e. we present an optimistic analysis. This is because Chebyschev’s 
inequality does not permit us bounding in the other direction. Still the analysis 
has interest since it can be compared against the lower bound obtained in inizj 
in order to check the improvement obtained in the lower bound. 

We discard an arbitrary point u with a pivot pi whenever |A — E| > r/ [3, 
where X and Y are as in the previous section. Hence Z = X — Y has mean zero 
and variance at most 2 (t^ and we can directly use Chebyschev to upper bound 
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the probability of discarding u with a random pivot pf. 



Pr{\X -Y\> r/(3) < 



2a^ 

{r/iiY 



and the probability of discarding an element u with any of k random pivots is 
1 — (1 — Pr{\X — Y\ > r//3)^). The average search cost is the internal evaluations 
k plus the external ones, which are on average n times the probability of not 
discarding a random element u. This gives a lower bound 

/ 2(t2 

Cost > k + n 1 — , 

V {r/l3rJ 

where, if we compare against the lower bounds obtained in we can see that 
the net effect is that of reducing the search radius. That is, our search cost is 
the same as if we searched with radius r/j3 (this reduction is used just to discard 
elements, of course when we compare q against the remaining candidates we use 
the original r). The search cost grows very fast as the search radius increases, 
so we expect a large improvement in the search time using a very moderate /3, 
and hence with a low error probability. To see the effect in terms of the intrinsic 
dimension, we convert again r = /i/(l — /) to get 



which shows that still the scheme worsens as p grows, despite being more per- 
missive. 

Now we relate (3 with its maximum allowed value obtained in the previous 
section (Eq. 0. The result is Cost > k + {1 — e)n, which says that, to obtain 
(1 — e) of the result, then we basically have to traverse (1 — e) of the data set. 

Of course this last result is not good. This is not a consequence of the ap- 
plication of Chebyschev (which is a very loose bound), but of the upper bound 
model we have used in Section 0 In we show that even a tighter error model 
leads to a poor result: we pay (1 — e) times the search cost of the exact algorithm. 

As we show next, in practice the results are much better. The real reason 
for the pessimistic analytical results is our poor understanding of the real be- 
havior of Z = X — Y under the condition d{q, u) < r. We have treated Z as if 
X and Y were independent (in p] we attempted a tighter model with similar 
results). In practice, the histogram of Z is much more concentrated and hence 
the error probability is much lower. Despite this looseness, the analysis permits 
understanding the role played by the different variables. 

Figure 0 shows the number of comparisons as a function of the fraction of 
relevant elements retrieved, for different combinations of numbers of pivots and 
search radii. We are using the same database as for the previous experiments. 

As can be seen, we can retrieve even 90% or 95% of the relevant elements 
paying much less than the time necessary for the exact algorithm (which corre- 
sponds to 1//3 = 1 in the plots). In many cases there is a large difference between 
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0.1 0.2 0.3 0.4 0.5 



0.7 0.8 0.9 



Fig. 5. Number of distance evaluations as a function of the fraction of elements re- 
trieved, for different dimensions. 



the costs to retrieve 99% and 100% of the set. These differences are more notori- 
ous when the number of pivots is not enough to get good results with the exact 
algorithm. In practice, we can obtain the same result as the exact algorithm with 
much less pivots. For example, 16 dimensions is almost intractable for the exact 
algorithm with less than 256 pivots, while with the probabilistic algorithm we 
can get acceptable results with 16 pivots. 

Figure 0 shows the effect of different search radii (retrieving from 0.01% to 
5% of the database). Note that increasing the search radius has an effect similar 
to that of increasing the dimensioiH. 



^ We chose to measure the radius in terms of percentage of the set retrieved because 
this gives a measure that can be translated into other metric spaces and permits 
understanding how much of the database needs to be traversed in order to retrieve 
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Finally, Figured (left) shows that, as we mentioned, the external complexity 
is not monotonously decreasing with the number of pivots k, as it is for the 
exact algorithm. This is because, as k increases, the probability of missing a 
relevant answer grows, and therefore we need a larger j3 to keep the same error 
probability. This, in turn, increases the search time. 





0 0.1 Oi 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



0.1 0.2 0.3 0.4 06 0.6 0.7 0.8 0.9 1 



Fractionof the result actually retrieved. 



Frai^on c4 the result actually retrieved. 



Fig. 6. Number of distance evaluations as a function of the fraction of elements re- 
trieved, for different radii. 




Fig. 7. On the left, number of external distance evaluations as a function of the number 
of pivots used, when retrieving a fixed fraction 0.97 of the relevant results with a search 
radius that should retrieve 0.5% of the dataset. On the right, comparative performance 
of our approach versus the exact algorithm for the edit distance on text lines. 



a given fraction of the total. Showing the actual radii used would just give useless 
geometrical information. 
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This fact worsens in higher dimensions: if we use enough pivots so as to fight 
the high dimension, then the error probability goes up. Therefore, as shown 
analytically, the scheme does also get worse as the dimension grows. However, it 
worsens much slower and obtains results similar to those of the exact algorithm 
using much less pivots. 



6 A Real-Life Example 

In this experiment we used a database of lines of text from the Wall Street 
Journal 1987 collection from TREC We cut the text in lines of at least 
16 characters without splitting words. We used the edit distance to compare 
database elements (the minimum number of character insertions, deletions and 
substitutions to make the two strings equal). This model is commonly used in 
text retrieval, signal processing and computational biology applications. 

Since the edit distance is discrete we cannot use an arbitrary /3, hence for 
r = 10 we tried search radii r//3 in the set [0..10]. In the x axis we put the 
probability of success (i.e. the fraction of the result actually retrieved) and in the 
y axis the fraction of distance computations with respect to the exact algorithm. 
Figure [7] (right) shows the results for different numbers of pivots. 

As can be seen, with a moderately high probability (more than 0.8) we can 
improve the exact algorithm by a factor of 3. The exact algorithm has to traverse 
around 90% of the database, while our probabilistic approach examines around 
26% of the database. It also becomes clear that there is an optimum number of 
pivots, which depends on the desired probability of success. 



7 Conclusions 

We have presented a probabilistic algorithm to search in metric spaces. It is 
based on taking advantage of high dimensionalities by “stretching” the triangle 
inequality. The idea is general, but we have applied it to pivot based algorithms 
in this work. We have analytically found the improvement that can be expected 
as a function of the error probability permitted, the fraction of the set that is 
to be retrieved and the number of pivots used. The analysis shows that the net 
effect of the technique is to reduce the search radius, and that the reduction is 
larger when the search problem becomes harder (i.e. p or / grow). Finally, we 
have experimentally shown that even with very little stretching (and hence with 
very low error probability) we obtain dramatic improvements in the search time. 

It is worth noting that f3 can be chosen at search time, so the indexing can 
be done beforehand and later we can choose the desired combination of speed 
and accurateness. Moreover, fi represents a deterministic guarantee by itself: no 
element closer than r/ (3 can be missed. 

A number of issues are left open for future work. A first one is to obtain an 
analysis which reflects better the experimental results. This includes character- 
izing the point where adding more pivots does not help. 
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We plan also to apply the idea to other data structures, which opens a number 
of possibilities. In general, all the algorithms can be modeled as performing some 
internal evaluations to discard elements with some rule related to the triangle 
inequality and later comparing the not discarded elements (external evaluations) 
(3- If the internal evaluations do not depend on each other, then using this 
technique the internal complexity remains the same and the external corresponds 
to searching with radius r / j3. 

Yet another idea is as follows: currently our algorithm is one-sided error, 
since it can miss relevant elements but cannot report non-relevant ones. It is 
not hard to make it two-sided error by removing the verification step, i.e. we 
simply report all the candidate elements that survive the triangle inequality test 
(with the smaller radius r//3). Hence the cost with k pivots is just k, but as k 
increases the probability of (over-reporting) error is reduced. So we have now 
two tools to handle the error probability: the number of pivots and the search 
radius. It is not clear how is the interplay between these three elements in this 
new scenario. We believe that the idea can yield good results for the applications 
where a two-sided error is acceptable. 
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Abstract. We develop computationally feasible algorithms to numeri- 
cally investigate the asymptotic behavior of the length Hd(n) of a max- 
imal chain (longest totally ordered subset) of a set of n points drawn 
from a uniform distribution on the d-dimensional unit cube = [0, 1]'^. 
For d > 2, it is known that Cd{n) = Hd{n)jini}^'^ converges in probability 
to a constant Cd < e, with limd_K 3 o Cd = e. For d = 2, the problem has 
been extensively studied, and, it is known that C 2 = 2. Monte Carlo sim- 
ulations coupled with the standard dynamic programming algorithm for 
obtaining the length of a maximal chain do not yield computationally 
feasible experiments. We show that Hd(n) can be estimated by consider- 
ing only the chains that are close to the diagonal of the cube and develop 
efficient algorithms for obtaining the maximal chain in this region of the 
cube. We use the improved algorithm together with a linearity conjec- 
ture regarding the asymptotic behavior of Cd{n) to obtain even faster 
convergence to Cd- We present experimental simulations to demonstrate 
our results and produce new estimates of Cd for d £ {3, . . . , 6}. 



1 Introduction 

Let 'V d = [0,1]'^ be the d-dimensional unit cube and let n random points 
x(l), x(2), . . . , x(n) be chosen independently from the uniform distribution on 
~V d- These points form the underlying set of a random order Pd(n) with a partial 
ordering given by x(f) < x(j) if and only if Xfc(i) < Xfc(j) for all /c = 1, . . . , d. 
The number of elements in a longest chain (totally ordered subset) of Pd(n) is 
called the height of Pd(n). We are interested in the asymptotic behavior of the 
random variable Cd{n) = Hd{n)/n^^^. The problem was intensively studied, ex- 
perimentally and mathematically, for the case of d = 2 (see ([B|, P, P|, [1 Hj - 
m, cs], 0, 0, P2I). It was shown that E[Ln]/y/n — ?> 2. The multidimensional 
case was initially considered by Steel in Bollobas and Winkler, proved 
that Cd{n) converges in probability to a constant Cd (d > 0). Except for C 2 = 2, 
no other Cd is currently known. Further, it is not even known whether or not the 
sequence {cd} is monotonically increasing in d. Using Monte Carlo simulation 
seems to be a natural approach to estimating Cd for d > 2. Unfortunately, it 
was discovered through experiments m), that even for d = 3, the estimates 
for Cd{n) do not converge for n up to 10®. The standard dynamic programming 
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Table 1. Estimates of Cd- The intervals indicate the estimated range of Cd- 



d 


d = 2 


d = 3 


d = 4 


d = 5 


d = 6 


Cd 


[1.998, 2.002] 


[2.363, 2.366[ 


[2.514, 2.521] 


[2.583, 2.589] 


[2.607, 2.617] 



algorithm for computing the length of a maximal chain is quadratic in n with 
memory requirement linear in n. Performing a sufficient number of Monte Carlo 
simulations for n Ri 10^'^ is not feasible, even with advanced computational re- 
sources. The time and memory requirements are too great. 

In this paper, we present an approach that addresses both the computational 
efficiency and the memory issues. Our estimates suggest that {cd} is a monoton- 
ically increasing sequence in d, a fact that has previously not been suggested by 
any experiments. 

Our technique combines two ideas. First, generalizing the result from g|, we 
prove that a maximal chain “r-close” to the diagonal of the unit cube = [0, 1]"^ 
must exist as n — > oo. Second, we design an implementation of the dynamic pro- 
gramming algorithm for computing a longest r-close chain which runs in 0{rn?) 
time, using 0(rn) memory. The parameter 0 < r << 1 can be decreased to 
boost the efficiency of the algorithm for very large values of n. The algorithm’s 
efficiency resulted from the fusion of the input generation and the computation 
of the longest r-close chain into one procedure where only 0(nr) points are “at 
work” at every moment of the computation. The reduction of the running time 
and the memory allows us to perform a sufficient number of Monte Carlo simu- 
lations. Based upon geometrical considerations, we conjecture that the maximal 
chains for different values of r converge at essentially the same rate. The simu- 
lations strongly support the conjecture, which together with the data yield the 
estimates shown in tabled 

The outline of the paper is as follows. First we describes two approaches 
to the design of experiments and then, we present experimental simulations 
demonstrating our algorithms. For proofs and other technical details, the reader 
is referred to the technical report, 0. 
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2 Location of Maximal Chains 

Since a longest chain is only a very small subset of the set of n random points in 
Vd (r), we can expect to gain an advantage over the straightforward simulation 
by restricting our search for such a chain to a small region where we are likely to 
find the chain. Let Vd(r) be the region of Vd = [0, 1]"^ obtained by translating 
a cube of side r along the diagonal as shown in Figure H The volume Vd{r) of 
Vd (r) is given by Vd(r) = r'^ -\- dr‘^ ^(1 — r). Let n points be sampled uniformly 
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Fig. 1. Diagonal volume element obtained by translation of a cube of side r 



in ~V d{r) to form a random partially ordered set Pd(n,r). Denote the height of 
this set by Hd{n,r) and define the random variable Cd{n,r) by 



Cd{n,r) 



Hd{n,r) 

{n/Vd{r))^^‘^ 



( 1 ) 



Intuitively, we expect that if one were to generate n* = |"n/Vd(r)] points in the 
whole cube V^, then about n of them would fall in Vd(r). If n* is large, we 
expect a maximal chain to exist in ~Vd(r). A formal statement to this effect can 
be proved, namely that for every r > 0, Cd(n, r) converges to Cd- For more details, 
the reader is referred to |H|. To compute Cd{n,r), there is no need to generate 
points in the whole cube: we only need to generate points in this diagonal region 
in a way that would be consistent with having generated a larger number of 
points in the entire cube. Thus, for a fixed r > 0, a Monte Carlo simulation to 
determine Cd(n,r) can be used to suggest a value for Cd, provided n is large. 
One might hope that the experiment with a large and infeasible n* that yields 
a good estimate for Cd would be equivalent to an experiment with a feasible 
n ~ n*Vd{r), for some r > 0. Thus, our general approach will be to estimate 
E[cd{n,r)] for various n and r by Monte Carlo simulation, and use these values 
to construct a final estimate for Cd- 



3 Boosting 

Not every combination of n and r is computationally equivalent to an n* = 
\n/Vd{r)'\. In fact, for fixed n, as r ^ 0, Cd{n,r) — >• which ap- 

proaches zero. This simply indicates that r is too small for the given n, and it is 
unlikely that a maximal chain of a random set with n* points in is located 
inside Vd(r). On the other hand, if n is fixed and r is sufficiently large, but less 
than 1, then it is likely that about n out of n* points fall in Vd(r) and a maximal 
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chain exists within even a smaller distance of the diagonal than r. These obser- 
vations suggest that given n, there exists an optimal r = ropt{n), which yield£l 
a maximal value for Cd{n,ropt)- In the next section, we present an algorithm for 
computing Cd{n,r) which requires 0{rnd) memory and runs in 0{rn^d) time. 
Thus, it is of practical importance to select as small an r as possible for which 
Cd{n,r) approximates Cd- We are thus led to the following experiment 

Experiment 1 (Boosting): 

1. Select a maximal feasible n. 

2. Set r = 1, and obtain an estimate for Cd(n, r). 

3. Decrease r (r ^ r — e, for a suitably chosen e) and re-estimate Cd{n,r). 
Repeat until Cd{n,r) stops increasing. 

4. Determine ropt{n), the r after which Cd{n,r) started to decrease. 

5. If ropt{n)'n?d is much less than the available resources, increase n {n <— 
n + no, for a suitably chosen no). Compute Cd(ji, r). Go back to step 3. 

6. Once a maximal n has been reached, and its associated optimal r, ropt(n) 
has been determined, compute Cd{n, Vopt) to the desired accuracy. Output 
this value as the estimate for Cd- 

As the name of Experiment 1 suggests, given a maximal n for the standard 
dynamic programming algorithm (determined by computational resources), we 
can boost it up to a higher n* by going to a smaller Vgpt and, perhaps, further 
increasing n as described in Experiment 1, while still using the same amount 
of computational resources. Our results demonstrate that boosting gives con- 
siderable improvement over the straightforward Monte Carlo. We can obtain 
even faster convergence to Cd by combining our boosting algorithms with a co- 
convergence conjecture. We discuss this next. 

4 Co-convergence 

Another approach to estimating Cd is to select a set of values for r 

and consider T sequences {cd{n,ri)} for i = 1 . . . T. Each of these sequences 
converges to the same limit as n — >■ oo. The problem now is one of estimating 
the common limit of these T sequences. Suppose Cd{n) (= Cd{n, 1)) converges at 
some rate f{n) to Cd'- Cd{n) = Cd~ f{n). Based on geometrical considerations, it 
is reasonable to expect that the convergence is also governed by f{n) and, thus, 
we are lead to conjecture that the order of convergence of Cd{n,r) for all r > 0 
is essentially the same, i.e., there exists a function /r(r) > 0, such that 

Vr > 0, Cd(n, r) = Cd- ^J.ir)f{n) + o(/(n)) (2) 

The goal of Experiment 1 was to obtain the smallest r, namely ropt(n), for which 
fJ-ifopt) is a minimum. The conjecture suggests a strong interdependence between 
the sequences, and we might be able to exploit this interdependence in order to 



^ In fact, our experiments suggest that Cd{n, r) has a single maximum as a function of 
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get a more accurate estimate of Cd- The traditional approach to obtaining the 
convergence point of the sequence Cd{n) would be to assume that f{n) has a 
certain form and then obtain a value for Cd consistent with this assumption and 
with the observed values Cd{n). The success of this kind of an approach depends 
largely on the validity of the assumption on /(n). Our conjecture allows us to 
estimate Cd without estimating f{n). Given ri,rj (r^ ^ rj), and n, J3) implies 
that the following two equalities hold simultaneously 

{ Cd{n, n) = Cd- fi{ri)f{n) + o(/(n)) 

Cd(n, rj) = Cd- ^i{r^)f{n) + o(/(n)) 

Resolving this system with respect to f{n), we have 

Cd{n,ri) = (1 - A{n,rj)) Cd + A{ri,rj)cd{n,rj) + o{f{n)). (3) 

where A{ri,rj) = ^(ri )/ In other words, Vf,j G [IjT] Cd(ji,ri) linearly 
depends on Cd{n,rj), up to o(/(n)). Further, from the functions Cd{n,Ti) and 
Cd{n,rj), one can estimate A{ri,rj) from the slope of this dependence, and 
(1 — A(ri,rj)) Cd from the intercept. Using these estimates, we can construct 
an estimate for Cd without having to make any statements about f(n). From 
m, we see that the convergence to linear behavior is at a rate o{f{n)), whereas 
the convergence of the sequences themselves is at the rate f{n). Thus, the linear 
behavior will materialize at smaller n than the actual convergence. Hence, we 
expect to extract more accurate estimates for Cd in this way, given the compu- 
tational resources. We are thus led to the following experiment. 

Experiment 2 (Co-convergence): 

1. Select a set T = {ri, . . . , rr} of values for r. 

2. Select a set Af = {ni , . . . , ul} of values for n. Let = {rii , . . . , rid} for 
h=l,...,L. 

3. Compute Cd{n,r), G T and Vn G N. 

4. Vi,j G [IjR] (i j), perform an analysis on the pair of sequences 
{cd(n,ri)| and {cd{n,rj)} for n G Afh to obtain the slope, Ad{h,i,j), 
and the intercept, Bd{h, z, j) , for = 2, . . . , L. 

5. Evaluate Cd(h,i,j) = Bd{h,i,j)/(l - Ad{h,iJ)) for /i = 1, . . . , L. 

j*Cd{h, i, j) is a sequence of estimates for Cd] this sequence should 
converge to Cd at the rate at which the linear behavior arises, z.e., 

6. Compute ed{i,j), the value to which Cd{h,i,j) converges with respect to 
h. 

7. Repeat steps 4-6 to compute ed{i,j) for all distinct pairs z, j G T. 

8. Let Ld = minijg 7 -ed(z, j) and let Ud = ed{ij). 

9. Output the interval [Ld, Ud] as an estimate for Cd- 
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5 Algorithms 

Both experimental approaches described in the previous section attempt to get 
more accurate estimates for Cd by effectively “accessing” higher n without actu- 
ally computing with the higher n. Both techniques rely on efficient algorithms 
for computing Cd{n,r). 

The first task is to generate n points chosen independently from a uniform 
distribution in V(i(r). A trivial algorithm is the one that generates random points 
in 'V d and keeps only those that fall in 'V di^), continuing until n points in 'V d{r) 
have been generated. This can be highly inefficient, especially if r is small, as 
the acceptance rate will be extremely small. A more efficient approach is to 
generate random points in 'V d{r) itself. This can be done quite efficiently, and 
further, it is possible to generate the points in a sequential manner so that the 
dynamic programming algorithm for computing Hd(n,r) keeps in memory only 
a small portion of the total of n points, those that are necessary for executing 
the algorithm. 

The second task is to compute Hd{n, r). The standard dynamic programming 
algorithm has computational complexity 0{dn^) [7j. However, since the points 
are generated in a sequential manner, it is possible to design an algorithm that 
maintains a working set that takes advantage of the sequential point generation. 
Operations need only be performed on this working set, resulting in a factor of 
r reduction in both computational complexity and memory requirements. 

5.1 Generation of Input Points 

The generation of points is illustrated in FigureO (a), for the case of d = 2. First, 
an “origin” point, t{i) = (t{i) , t{i) , . . . , t{i)) , is generated along the diagonal. 
Then one of the hypercubes Pfc(i), k = 1, . . . , d of dimension d — 1 is chosen at 
random and a point is generated from a uniform distribution on this hypercube. 
In this way one can generate n points in Vd(r). The probability density of the 
origin point coordinate t{i) is uniform up to t{i) = 1 — r and then decaying like 
(1 — for t{i) > 1 — r. This density is shown in Figure El (b), for the case 

d = 3 and the case r = 0.5. Further, instead of generating the origin points in a 
random order, we generate the n order statistics for the origin points. 

Algorithm 1 (Sequential generation of input points): 

1. Set 1 = 1 and tprev = 0. 

2. Generate the origin point t(i) = where t{i) G 

[tprev, 1] is the order statistic (given that the (i — 1)*^ order statistic 
was at tprev) of n points generated from the distribution of origin points 
such as the one shown in Figure El 

3. Generate a random vector ={vi,V 2 , ■ ■ ■ , Vd} where each Vi is in 
[0,min{r, I - U}] 

4. Generate a random integer k from {1, 2, . . . , d} and set Vk = 0 

5. Generate input point x(z) = {t{i) + vi,t{i) + V 2 ,--- ,t{i) + Vd}- 

6. Set tprev = t{i), i = i + 1 and go back to step 2 if i < n. 
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Density of the Oriain Point Coordinate, t 




Fig. 2. (a) The sequential generation of points in Vd(r) illustrated for the case d — 2. 
(b) Probability density of origin points for d=3, r=0.5 



The detailed implementation of step 2 to generate the order statistics for the 
origin points will be presented in the final paper. The next proposition states two 
(almost trivial) ordering properties that the points x(l), . . . ,x(n) have. While 
these properties might seem trivial, they are of key importance to the algorithm 
for computing the height of the n points in V(i(r), as we shall see in the next 
section. 

Proposition 1 (Ordering Properties). Suppose points x(l), . . . ,x(n) are gen- 
erated aeeording to the algorithm above. Then, 

1. If t(i) > x(j), then, for all k >i, x(fc) > x(j). 

2. If i < j, it cannot he that x(i) > x(j), more specifically, P[x(f) > x(j)] = 0. 

5.2 Computing J’) 

Although the ordering of the origin points does not guarantee that the pro- 
jected input points {x(l), x(2), ..., x(n)} are ordered, it does guarantee the 
ordering properties stated in proposition Q1 Property 2 ensures a newly gen- 
erated point can never be below a previously generated point. Therefore, the 
height of newly generated points can be immediately computed without consid- 
ering future points. At iteration i, property 1 ensures x(i) has a height of at 
least h(x(j)) -I- 1 for all x(j) < t(z). Among these input points, the x(j) with 
maximum height is identified as the harrier point and the rest are discarded. The 
remaining input points (all x(j) > t(z) plus the barrier point) define the working 
set for iteration z -|- 1. The height of x(z -|- 1) can be determined by inspecting 
only the points in the working set. 

Hd{n,r) is computed using the following procedure where t and x are the 
origin point and projected input point as generated by sequential generator 
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described above. h{x) is the height of the maximal chain ending at point x, and 
Xbarrier IS the harrier point. 

Algorithm 2 (Computation of Hd{n,r))-. 

1. Set h^dx — 1- 

2. Using Algorithm 1, generate t(l) and x(l) and add x(l) to the working 
set. 

3. Set the height of x(l) to 1 (/i(x(l)) = 1). 

4. For i = 2 to n 

(a) Using Algorithm 1, generate t{i) and x(i). 

(b) Initialize the height of x(i) to 1 (/i(x(z)) = 1). 

(c) For every point x(j) in the working set 

i. If x(z) is above x(j) and h{x{j)) >= h{x{i)) then set h{x{i)) = 
/i(x(j)) + I. 

ii. If x(j) is below t{i), if the harrier point is defined, and if h{x{j)) > 
h(xbarrier), then remove the current harrier point and replace it 
with x(j). 

iii. If h(x{j)) <= h{xbarrier) then remove x(j) from the working set 

iv. If x{j) is below t{i) and the harrier point is not defined then set 
x(j) to be the harrier point. 

(d) Add x(i) to the working set. 

(e) If h{x{i)) > h^ax then set hraax = 

5. Return hraax! 



Iteration i Iteration i+1 Working Set 




KEY: 

0 Discarded input point 
©Working set input point 
• Origin point 



Fig. 3. The working set at iteration i and i + 1 



The algorithm computes the height of each x(i) as the input points are 
generated. x(i) is given a default height of 1. Every input point forms a chain 
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of height 1 with itself as the beginning and end points. The new input point 
is compared with every x(j) in the working set to find the highest chain which 
can be continued by x(i). Each x(j) is also compared with t(i) to determine if 
it is below the barrier. If x(j) is below the harrier, it either is removed from 
the working set or (in the case where /i(x(j)) > h{xbarrier)) replaces the current 
barrier point. After iteration i, the working set includes only the harrier point 
and the points that are not below the Barrier t{i) (see Figure 0. This defines 
a subset of the ~V d(r) which has a volume Vd(r)r. Since n points are uniformly 
distributed in Y d{r), it is expected that the working set contains rn points. 
Since the algorithm must iterate over the working set for every newly generated 
point, the expected number of point comparisons is O(rn^). Therefore, using a 
very small r not only increases the number of virtual points n* but decreases 
the memory requirements and the number of computations by a factor of r. The 
proof that this algorithm is correct can be found in pj. 

6 Experiments 

We have performed computer simulations to support the claims that boosting 
provides a considerable efficiency gain over straightforward Monte Carlo, and 
that the co-convergence experiment coupled with the algorithms for boosting 
lead to rapidly converging estimates for {cd(n,r)}. We present results for the 
case d = 3 here. We have also performed experiments for other dimensions, 
d = 2, 4, 5, and 6, with similar results. 

Our first series of simulations used the values for n and r given in the next 
table to compute Cd{n,r). 

n e {50000, 100000, 200000, 300000} 

r G (0.01, 0.02, 0.06, 0.08, 0.10, 0.14, 0.20, 0.30, 1.00} 

The dependence of Cd{n,r) on r for various n is shown in Figure 0 The larger 
the estimate, the closer it is to the true value. It appears that the radius that 
maximizes Cd(n, r), r^pt Ri 0.08. With r = 1, to obtain an estimate comparable 
to using ropt and n = 50000, one needs n > 300000 which represents a gain in 
efficiency by a factor of over 450 in favor of boosting. Further, figure 0 shows 
that the maximum value oicd{r, n) is achieved using roughly the same value of r 
for 4 different values of n. The data indicates that ropt is independent of n which 
supports the fact that pt{r) is independent of n, as required by the conjecture. 

Our second set of simulations tests co-convergence (according to Experiment 
3). We selected n and r according to the following scheme. 

rife = (io2+i*=xo ij for k G [0,40], and r G {0.01,0.005,0.001,0.0005,0.0001} 

( 4 ) 

Figure 13a) demonstrates the convergence behavior of c^(n,r), for the five 
values of r. Even for n = 10®, it is not clear to what value these sequences 
are converging. Given Cd{n, ri) and Cd{n, rj) for n = ni, . . . , n^, we can estimate 
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Fig. 4. Dependence of Cd(n, r) on r for various n’s. 





(a) 



(b) 



Fig. 5. (a) Convergence of ca(n, r). (b) Convergence of and e 2 ,{i,j) 



Ad{h, i,j) and Bd{h, i, j) for ft, = 2, . . . , L as in Experiment 3. The estimate given 
by B{h, — A{h, i, j)) should converge with respect to ft to Cd, independent 

of which particular pair z, j is used. The behavior of our estimate for every 
pair of radii is demonstrated in figure El (b). It is clear that the curves are 
all converging to the same value. Further, as expected, the convergence occurs 
earlier, as compared with the convergence of 03(71, r) in figureEl(a), in accordance 
with the expected o(/(n)) behavior of this estimate. We propose a range for C3 by 
taking the range of values to which these curves are converging. Similar results 
have been obtained for the cases d = 2, 4, 5, 6, and the estimates for Cd were 
presented earlier. 
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Abstract. In nearest neighbor searching we are given a set of n data 
points in real d-dimensional space, and the problem is to preprocess 
these points into a data structure, so that given a query point, the nearest 
data point to the query point can be reported efficiently. Because data 
sets can be quite large, we are interested in data structures that use 
optimal 0{dn) storage. 

In this paper we consider a novel approach to nearest neighbor search- 
ing, in which the search returns the correct nearest neighbor with a given 
probability assuming that the queries are drawn from some known dis- 
tribution. The query distribution is represented by providing a set of 
training query points at preprocessing time. 

The data structure, called the overlapped split tree, is an augmented 
BSP-tree in which each node is associated with a cover region, which 
is used to determine whether the search should visit this node. We use 
principal component analysis and support vector machines to analyze 
the structure of the data and training points in order to better adapt 
the tree structure to the data sets. We show empirically that this new 
approach provides improved predictability over the kd-tree in average 
query performance. 



1 Introduction 

Nearest neighbor searching is a fundamental problem the design of geometric 
data structures. Given a set S' of n data points in some metric space, we wish to 
preprocess S into a data structure so that given any query point q, the nearest 
point to g in S can be reported quickly. Nearest neighbor searching has ap- 
plications in many areas, including knowledge discovery and data mining m, 
pattern recognition and classification PF7|, machine learning data com- 
pression El. multimedia databases El. document retrieval d. and statistics 
m- Many data structures have been proposed for nearest neighbor searching. 
Because many applications involve large data sets, we are interested in data 
structures that use only linear or nearly linear storage. Throughout we will 
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assume that the space is real d-dimensional space and the metric is any 
Minkowski metric. For concreteness we consider Euclidean distance. 

Naively, nearest neighbor queries can be answered in 0{dn) time. The search 
for more efficient data structures began with the seminal work by Friedman, 
Bentley, and Finkel who showed that such queries could be answered effi- 
ciently in fixed dimensions through the use of kd-trees. Since then many different 
data structures have been proposed from the fields of computational geometry 
and databases. These include the i?-tree and its variants PE3], the A- tree |H|, the 
SR-tree PH), the TV-tree |2S1, and the BAR tree ^H], not to mention numerous 
approaches based on Voronoi diagrams and point location H. 

Although nearest neighbor searching can be performed efficiently in low- 
dimension spaces, for all known linear-space data structures, search times grow 
exponentially as a function of dimension. Even in moderately large dimensional 
spaces, in the worst case, a large fraction of points in S are visited in the search. 
One explanation for this phenomenon is that in high dimensional space, the 
distribution of the inter-point distances tends to concentrate around the mean 
value. As a consequence, it is difficult for the search algorithm to eliminate points 
as being candidates for the nearest neighbor. Fortunately, in many of the data 
sets that arise in real applications, correlations between dimensions are common, 
and as a consequence the points tend to cluster in lower dimensional subspaces 
m- Good search algorithms take advantage of this low-dimensional clustering 
to reduce search times. 

A popular approach to reducing the search time is through approximate near- 
est neighbor search. Given e > 0 the search may return any point p in S' whose 
distance to query q is within a factor of (1 -I- e) of the true nearest distance. Ap- 
proximate nearest neighbor search provides a trade-off between speed and accu- 
racy. Algorithms and data structures for approximate nearest neighbor searching 
have been given by Bern 0, Arya and Mount P, Arya, et al. |2|, Glarkson cn, 
Ghan |^, Kleinberg EH, Indyk and Motwani E3, and Kushilevitz, Ostrovsky 
and Rabani m- 

Our experience with the ANN library for approximate nearest neighbor 
searching, we have observed two important phenomena. The first is that sig- 
nificant improvements in search times often require uncomfortably large values 
of e. The second is that the actual average error committed in the search is 
typically much smaller (by factors of 10 to 30) than e. The combination of these 
two effects result in an undesirable lack of predictability in the performance of 
the data structure. In order to achieve greater efficiency, users often run the 
algorithm with a high e, and sacrifice assurances of accuracy on each query for 
better speed and the hope of good average case performance m One of our mo- 
tivations in this research is to find data structures that provide good efficiency 
but with a higher degree of predictability. 

In this paper we consider an alternative approach for dealing with this short- 
coming. We propose a new data structure for nearest neighbor searching, which 
applies a different kind of approximation. In many applications of nearest neigh- 
bor searching, the performance of any one query is not as important as the 
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aggregate results. One example is image compression based on vector quan- 
tization M- An error committed on a single pixel may not seriously impact 
the overall image quality. Our approach may be described as a probably- correct 
nearest neighbor search. Assuming that the queries are drawn from some known 
distribution the search returns the true nearest neighbor with a probability that 
user can adjust. The query distribution is described by providing a set of train- 
ing queries as part of the preprocessing stage. By analyzing the training data we 
can better adapt the data structure to the distributional structure of the queries. 
The idea of allowing occasional failures was considered earlier by Ciaccia and 
Patella m in a more limited setting. 

We introduce a new data structure for this problem called an overlapped 
split tree or os-tree for short. The tree is a generalization of the well known 
BSP-tree, which uses the concept of a covering region to control which nodes are 
searched. We will introduce this data structure in the next section, and discuss 
how it can be applied for both exact and probably-correct nearest neighbor 
searching. In the subsequent section we provide experimental evidence for the 
efficacy and efficiency of this data structure. We show empirically that it provides 
an enhanced level of predictability in average query performance over the kd-tree 
data structure. 

2 Overlapped Split Tree 

The os-tree is a generalization of the well-known binary space partition (BSP) 
tree (see, e.g., M)- Consider a set S of points in d-dimensional space. A BSP-tree 
for this set of points is based on a hierarchical subdivision of space into convex 
polyhedral cells. Each node in the BSP-tree is associated with such a cell and the 
subset of points lying within this cell. The root node is associated with the entire 
space and the entire set. A cell is split into two disjoint cells by a hyperplane, 
and these two cells are then associated with the children of the current node. 
Points are distributed among the children according the cell in which they are 
contained. The process is repeated until the number of points associated with a 
node falls below a given threshold. Leaf nodes store these associated points. 

Many of the data structures used in nearest neighbor searching are special 
cases or generalizations of the BSP-tree, but we will augment this data structure 
with an additional element. In addition to its BSP-tree structure, each node of 
the os-tree is associated with an additional convex polyhedral cell called a cover. 
Intuitively, the cover associated with each node contains every point in space 
whose the nearest neighbor is a data point in the associated cell. That is, the 
cover contains the union of the Voronoi cells da of the points associated with the 
cell. (Later we will need to relax this condition, for purposes of computational 
efficiency.) The covers of the children of a node will generally overlap each other. 

Fig.n shows an example of a parent and its two children in the plane. In each 
case the cell is shaded and the cover is the surrounding polygon. In typical BSP 
fashion, the parent cell is split by a line, which partitions the point set and cell. 
The Voronoi bisector between the two subsets of points is shown as a broken 
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Right child 





Fig. 1. Overlapping covers 



line. Observe that the covers for the left and right children are large enough to 
enclose the Voronoi bisector, and hence all the points of the space whose nearest 
neighbor lies within the corresponding side of the partition. 

Let us describe the os-tree more formally. The data point set is denoted by 
S. Each node of the tree is denoted by a string of Vs and r’s. The root node 
is labeled with the empty string, (f>. Given a node 6 its left child is SI, and its 
right child is Sr. (The terms left and right are used for convenience, and do not 
connote any particular orientation in space.) Give a node S, we use cs to denote 
its cell, Cs to denote its cover, and Ss to denote its subset of points. 

The entities cs and Ss together form a BSP-tree for the point set. The cover 
of a node 5 is a convex polyhedron that has the following properties: 

— The cover of the root node (p is the entire space. 

— Let V{p) denote the Voronoi cell of point p in Ss, and let Vs = UpeSi ^(p)- 
Then 

Cs 2 Pi. 

— Let Hs be a set of {d — l)-dimensional hyperplanes that bound Cs- There 
are two parallel hyperplanes L and R such that 

Hsi CHsU {L} Hsr QHsU {i?}. 

(Note that these hyperplanes need not be parallel to the BSP splitting hy- 
perplane.) 

Each internal node of the tree stores the coefficients of hyperplanes L and R. 
The selection of these hyperplanes will be discussed later. 

To answer the nearest neighbor query q, the search starts at the root. At 
any internal node S, it determines if the query lies within either cover Csi or 
Csr (it may be in both) . For each cover containing q, the associated child is 
visited recursively. From the definition of the cover, if the query point does not 
lie within the cover then the subtree rooted at this node cannot possibly contain 
the nearest neighbor. When the search reaches a leaf node, distances from the 
query to all associated points in the leaf are computed. When all necessary nodes 
have been visited, the search terminates and returns the smallest distance seen 
and the associated nearest neighbor. It is easy to see that this search returns 
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the correct result. Observe that the efficiency of the search is related to how 
“tightly” the cover surrounds the union of the Voronoi cells for the points, since 
looser covers require that more nodes be visited. 



2.1 Probably Correct Os-tree 

One of the difficulties in constructing the os-tree as outlined in the previous 
section, is that it would seem to require knowledge of the Voronoi diagram. The 
combinatorial complexity of the diagram grows exponentially with dimension, 
and so explicitly computing the diagram for high dimensional nearest neigh- 
bor searching is impractical. In order to produce a more practical construction, 
we first introduce a variant of the os-tree in which queries are only answered 
correctly with some probability. 

Let us assume for now that the probability density function of query points 
is known. (This is rather bold assumption, but later we will see that through 
the use of training data we can approximate what we need to know about the 
distribution reasonably accurately.) In particular given a region V C let 
Q{X) to be the probability that a query lies within X. Given real /, 0 < / < 1, 
let C{f,Q)s denote any convex polyhedron such that 

Vs\Q{C{f,Q)s) ^ 

Q{Vs) - 

In other word, the region C{f,Q)s contains all but at most a fraction / of the 
probability mass of Vs- 

The os-tree has the property that the cover of each node contains the covers of 
its children. Given the probability density Q and a search algorithm, the failure 
probability of the algorithm is the probability (relative to Q) that this algorithm 
fails to return the correct nearest neighbor. The following is easy to prove. 

Lemma 1. Given queries drawn from Q, if the cover for each leaf 6 of an os-tree 
satisfies the condition C{f,Q)s above, then the failure probability for the os-tree 
search is at most f. 

The lemma provides sufficient conditions on the covers of the os-tree to guar- 
antee a particular failure probability. In the next section, we discuss the efficient 
construction of the os-tree in greater detail. 



2.2 Building the Os-tree 

We do not know of a practical way to construct the ideal os-tree that we have out- 
lined earlier. This is because complex query distributions are not generally easily 
representable, and the shape of Voronoi cells of a high dimensional point set can 
be highly complex. Finding a convex polyhedron that covers a desired fraction 
of the density of a set of Voronoi cells seems to require intensive computation. 
We incorporate the following changes to the os-tree in our implementation: 
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— Instead of using an abstract query distribution Q, we use a set T of training 
query points, which are assumed to be randomly sampled from Q. Each 
training point will be labeled according to its nearest neighbor in S, that is, 
the Voronoi cell containing it. 

— In the search process, we add the distance comparison (described below) as 
another criterion for pruning nodes from being visited. 

Although the assumption of the existence of the training set is a limitation 
to the applicability of our approach, it is not an unreasonable one. In most 
applications of nearest neighbor searching there will be many queries. One mode 
in which to apply this algorithm is to sample many queries over time, and update 
the data structure periodically based on the recent history of queries. 

Given the data points S and training points T, the construction process starts 
by computing the nearest neighbor in S for each point in T. The construction 
then starts with the root node, whose cell and cover are the entire space. In 
general, for node 5, there will be associated subsets Ss and Ts- If the number of 
points in Ss is smaller than a predefined threshold, then these points are stored 
in a leaf cell. Otherwise cs is split into two new cells by a separating hyperplane, 
Hs, which partitions Ss into subsets of roughly equal sizes. (The computation 
of Hs is described below.) This hyperplane may have an arbitrary orientation. 
We partition the points among the two resulting child nodes. 

We then use the training set Ts to compute the cover of the node. Partition 
this into Tsi and Tsr depending whether its nearest neighbor is in Ssi or Ssr, 
respectively. We then compute two (signed) hyperplanes, Bsi and Bsr, called the 
boundary hyperplanes. (By a method described below.) These two hyperplanes 
are parallel to each other but have oppositely directed normal vectors. Define 
Bsi so that it bounds the smallest halfspace enclosing Tsi- This can be done by 
computing the dot product between each point in Tsi and Bsi's normal, and 
taking the minimum value. Do the same for Bsr- These two hyperplanes are 
then stored with the resulting node. Define Csi to be Cs H B'g^, where Bg^ is the 
halfspace bounded by Bsi, and do similarly for Csr- The process is then applied 
recursively to the children. 

To answer the query, the search starts at the root node. At any internal node, 
we find the location of the query point relative to the separating hyperplane Hs 
and the two boundary hyperplanes Bsi and Bsr- (This can be done in 0{d) 
time.) For concreteness, suppose that we are on the left side of the separating 
hyperplane. (The other case is symmetrical.) We visit the left child first. If the 
query point does not lie within right bounding halfspace, B'g^ (that is, it is not 
in the overlap region) then the right child will not be visited. Even if the query 
is in the overlap region, but if the current nearest neighbor distance is less than 
the distance from the query point to the separating hyperplane, then the right 
child is not visited. (Since no point on the other side of Hs, and hence no point 
in Ssr can be closer.) Otherwise the right child is visited. When we reach a leaf 
node, distances to the data points are computed and the smallest distance is 
updated if needed. 
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2.3 The Os-tree Splitting Rule 

The only issues that are left to be explained are the choices of the splitting hyper- 
plane and the boundary hyperplanes. We impose the requirement that roughly 
half of the points in the cell lie on one each side of the splitting hyperplane. 
This condition guarantees a balanced tree. An obvious criterion for selecting the 
orientation of the hyperplane is to subdivide the points orthogonal to the direc- 
tion of greatest variation. To do this we use the well known principal component 
analysis (PCA) method m- We compute the covariance matrix for the data 
points Ss, and sort the points according to their projection onto the eigenvector 
corresponding to the largest eigenvalue. Half of the resulting sequence half of the 
points are placed in Sgi and the larger half in Sgr- The use of PCA’s in choosing 
splitting planes has been considered before (see, e.g., 1551b 

It is clear that if the maximum of the projections of I labeled point is lower 
than the minimum of the projections of r labeled points, then there are infinitely 
many hyperplanes that can be choosen as the separating hyperplane. The opti- 
mal hyperplane (with respect to the search time) is the one that minimizes the 
number of training points in the final overlap region. The more training points 
in the overlap region, the more likely that both children of the current node will 
be visited. We use the support vector machine (SVM) method |35 to choose 
separating hyperplane. Even the SVM method does not give us the optimal hy- 
perplane, but it finds a hyperplane that is good enough in the sense that it is 
the one that is the furthest from closest data points. 

The SVM method was developed in the area of learning theory |23| to find 
the hyperplane that best separates two classes of points in multidimensional 
space. A more detailed description of SVM can be found in 0. In this paper, we 
will review the basic linear SVM method using the hyperplane as the separator 
in the original space. 

By construction the two data point sets are linearly separable. In this case 
SVM finds a separating hyperplane with the highest margin, where margin is 
defined as the sum of the distance from the hyperplane to the points in Sgi 
and the distance to the points in Sgr- SVM formulates this problem a non- 
linear optimization problem. Solving the optimization problem returns a set of 
points called support vectors as well as other coefficients. These values are used 
to determine the separating hyperplane. The result of SVM is the separating 
hyperplane Hg of the node. Now, the training points associated with the node 
are used to determine the final placement of the boundary hyperplanes Hgi and 
Hgr, as described earlier. 

2.4 Implementation 

We modified the ANN library [2I| to accommodate the os-tree. This is a library 
for approximate nearest neighbor search using the kd-tree and bbd-tree |3| as the 
underlying data structures. We use a PCA program from and the LIBSVM 
library |2( as the SVM module. LIBSVM is a simple and fast implement of SVM. 
It uses a very simple optimization subroutine to find the optimal separating 
hyperplane. Like most other SVM packages, the result of LIBSVM may not be 
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the optimal hyperplane. This is due to numerical errors and the penalty cost. 
Based on our experiments, if we use a higher value of penalty cost, it usually 
produces a better separating hyperplane, but it requires more CPU time. In all 
of our experiments, we set the penalty cost to 1000. 

Even though LIBSVM is relatively fast, it was the main bottleneck in the 
construction time of the data structure. Its running time is superlinear in the 
number of input points. In the implementation, instead of providing SVM with 
the entire set of data points in the cell, we divided the points into smaller batches. 
We call this version the hatched SVM. Each batch has a relatively small number 
of points, 200. We then invoke the SVM routine for each batch, and collect the 
resulting support vectors. We ignore other points that are not support vectors 
because they are unlikely to have a significant influence on the final result of 
SVM. The support vectors of each batch are then combined and used as the input 
to SVM. This process is repeated until the size of the input is small enough to run 
a single SVM. We use the result from this final run as the separating hyperplane 
of the cell. 

Obviously, the batched version of SVM would be expected to perform more 
poorly than the standard (unbatched) method. But we found that over a number 
of experiments it usually produced a hyperplane that is close to the result of the 
standard algorithm. The CPU time required to find the hyperplane using the 
batched SVM is significantly less than one using the standard SVM. 

3 Experimental Results 

3.1 Synthetic Data Sets 

We begin by discussing our results on synthetically generated data sets. Because 
the os-tree requires a relatively large number of training points, one advantage 
of synthetic data sets is that it is easy to generate sufficiently large training sets. 
To emulate real data sets, we chose data distributions that are clustered into 
subsets having low intrinsic dimensionality. (Our subsequent experiments with 
real data sets bears out this choice.) The two distributions that we considered 
are outlined below. 

Clustered-rotated-flats: Points are distributed in a set of clusters. The points 
in each cluster are first generated along an axis-aligned flat, and then this flat 
is rotated randomly. The axes parallel to the flat are called fat dimensions, 
and the others are called thin dimensions. The distribution is characterized 
by the following parameters: 

— the number of clusters (fixed at 5 clusters in all of our experiments), 

— the number of fat dimensions, 

— the standard deviation for thin dimensions, Uthin- 

~ the number of rotations applied. 

The center of each cluster is sampled uniformly from the hypercube [—1,1]'^. 
Then we randomly select the fat dimensions from among axes of the full 
space. Points are distributed randomly and evenly among the clusters. In fat 
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dimensions, coordinates are drawn from a uniform distribution in the interval 
[—1,1]. In thin dimensions, they are sampled from a Gaussian distribution 
with standard deviation CTthin- We compute a random rotation matrix for 
each cluster. This is done by repeatedly multiplying a rotation matrix (ini- 
tially a, dxd identity matrix) with a matrix A. A matrix is an identity matrix 
except for four elements An = Ajj = cos{9),Aij = —Aji = sin(0), where i 
and j are randomly chosen axes and 9 is randomly chosen from — tt/ 2 to tt/ 2. 
We apply the rotation matrix to all points in the cluster and then translate 
them by a vector from the origin to the center of the cluster. 
Clustered-rotated-ellipsoids: This distribution is similar to the clustered- 
rotated-flats distribution, except that the coordinates are sampled from a 
Gaussian distribution on fat dimensions instead of a uniform distribution. 
It has the additional parameters uio and CThi, which are the minimum and 
maximum standard deviations of the Gaussian distribution of the fat dimen- 
sions. The actual standard deviation is randomly chosen in [crio,(Thi]. In our 
experiments, we used u\o = 0.25 and Chi = 0.5. 

We compared the query performance of the os-tree against that of the kd- 
tree. The query performance was measured by the total number of nodes visited, 
the number of leaf nodes visited, and the GPU time used. We show GPU time in 
most of our graphs. We compiled both programs using the g-l— I— 2.95.2 compiler 
with the same options, and we conducted all experiments in the same class of 
machines (UltraSparc-5 running Solaris 2.6 for the synthetic data sets and a PG 
Geleron 450 running Linux for the real data sets). In the first set of experiments 
we also compared the total number of nodes visited as well. 

There is some difficulty in a direct comparison between the os-tree and the 
kd-tree because the search models are different: probably-correct model in os- 
tree and approximately correct model in the kd-tree. To reconcile this difference, 
in each experiment, we adjusted the e value (approximation parameter) of the 
kd-tree so that the resulting failure probability of the kd-tree is similar to that 
of the os-tree. Once the epsilon value is found, we ran the experiments with the 
same query set because changing it may alter the failure probability. This gives 
the kd-tree a slight advantage, because the query set is the same as the training 
set. 

The results of the first set of experiments are shown in Fig. 0 The point 
set, training set and query set (for os-trees) are all sampled from the clustered 
rotated ellipsoids distribution. The number of clusters is fixed at 5 and (Jthin is 
fixed at 0.05 and 0.25. We show the results for dimensions d = 10,20,30. The 
number of fat dimensions is fixed, relative to d, at . The number of rotations 
applied is equal to the dimension of the space. The size of the point set varies 
from 2K (K = 1024) to 32K points. For all experiments, the number of training 
points (in os-trees) is fixed at 200 times the size of the point set and the results 
are average over 5 different trees, each with lOK different queries. 

The top graphs of Fig. 0 shows comparisons of the average GPU time each 
query uses. The bottom graphs compare the total number of nodes visited. We 
can see that these two running times are remarkably similar. The kd-tree’s query 
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Fig. 2. Query time and number of nodes visited comparison for the kd-tree and os-tree. 
All distributions are clustered-rotated-ellipsoids 



time is somewhat better than that of os-tree when the point set is large and crthin 
is high. The search in os-tree is slightly faster in low dimension (d = 10) and for 
smaller data set sizes. Overall, the differences in the query time of both trees are 
quite small. 

The second set of experiments is quite similar to the first one except that 
we change the distribution from clustered rotated ellipsoids to clustered rotated 
flats. The query time comparison of both trees is presented in Fig. El The results 
of Fig. El are very similar to those of Fig. El The query times of both trees are 
comparable. Note that the search is somewhat faster for clustered rotated flat 
distributions for both trees. 

In the next set of experiments we varied the density of the clusters. By varying 
o’thin from 0.01 to 0.5, the clusters are less dense and the distribution is more 
uniform. The point set size is fixed at 4K and 16K with d rotations applied. 
Fig. Elshows the results. Again, the performances of both trees are quite similar. 

3.2 Enhanced Predictability 

We have shown that the os-tree and kd-tree are similar with respect to running 
time (subject to the normalizations mentioned earlier). The principal difference 
between them is in the degree to which it is possible to predict performance. In 
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stdev-thin = 0.05 stdev-thin = 0.25 





Fig. 3. Query time comparison for the kd-tree and os-tree. All distributions are 
clustered-rotated-flats 



point set size = 4k 




point set size = 16k 




Fig. 4. Effect of (Tthin (stdcv-thin, standard deviation on thin dimensions of the flats). 
All distributions are clustered rotated flats 



all the experiments above, we set the size of training set to be 200 times the size 
of the point set. Based on the ratio between the size of training set and the size of 
data set, the predicted failure probability of the os-tree is around 0.5%. We also 
computed the failure rate of similar experiment runs of the kd-tree with e = 0.52 
(average value of matching values of e) and e = 0.8. Fig. Elshows the failure rate 
sorted in increasing order of some 258 experimental runs. We can see that the 
variation of the failure rate of kd-tree is much higher (which is not unexpected, 
since the kd-tree is not designed for probably-correct nearest neighbor search.) 
This justifies our claim that the os-tree provides comparable efficiency as the 
kd-tree, but with significantly enhanced predictability. 

3.3 Real Data Sets 

To demonstrate that the synthetic experiments were reflective of actual perfor- 
mance in practice, we also ran the algorithm on a number of real data sets. These 
data sets are: 
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Fig. 5. Failure rate, sorted in increasing order from 258 runs. All distributions are 
clustered rotated ellipsoids 




Fig. 6. Average error. Showing the average (connecting line), standard deviation (box), 
and maximum (bar line) of both kd-tree (top graph) and os-tree (bottom graph). All 
distributions are clustered rotated flats 



MODIS satellite images: MODIS (moderate resolution imaging spectrora- 
diometer) is a sensor equipped with a satellite to acquire data in 36 spectral 
bands. The data from the sensor are archived in files. The data in each file 
correspond to a specific region (of size roughly 2000 by 1400 kilometers) 
of the earth at a particular time. The data are then processed into several 
levels of various usage purpose. (For more information about MODIS, see 
http://modis.gsfs.nasa.gov.) We use minimal processed level la and lb 
data sets with aggregated resolution of 1 kilometer (some bands have finer 
resolution). The data file also contains information about the percentage of 
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valid observations in each spectral band. Bands with 90% or higher of valid 
observations are used in the experiments. Each data file contains nearly 3 
million pixels, each pixel is a point in a 36 dimensional space. Each coordi- 
nate value is stored in 2-byte short integer format. We assign these pixels 
randomly to the data set, query set and training set. There are four MODIS 
data sets used in the experiment, the first one is a level la data set covering 
the California area. The second data set is a level lb image taken above 
Northwest Africa. The third and forth data sets are also level lb images 
above the Himalaya region. 

Video sequences: These short video sequences are captured with fixed posi- 
tion cameras and stored in uncompressed format. The cameras are at dif- 
ferent position capturing the same moving subject. Each frame in the video 
sequences is 320 x 240 8-bit per channel RGB color image. Data are points in 
3t^ dimensional space. The value of each coordinate of a point is the inten- 
sity of R,G,B channels of a f x t pixels in the image. We use 2x2 and 3x3 
pixels to convert image into a 12- or 27-dimensional point set, respectively. 
Again, we assign the points randomly to data, query and training set. 
World Ocean Atlas: This data set contains information about some basic at- 
tributes of the oceans of the world. Examples of attributes are temperature, 
salinity, etc. The complete data set contains annual, seasonal and monthly 
data as well as some statistic at 1-degree and 5 degree resolution. At each 
location, there are data for various depths of the ocean. We use only seasonal 
1-degree analyzed mean data. There are 9 attributes in the data set, but we 
use only 8 of them since the size of the last attribute was much smaller than 
the others. The data type of all attributes is floating point. 

In order to support the assertion that the os-tree and kd-tree are of com- 
parable efficiency, consider Fig. 0 (again selecting the e for the kd-tree so that 
the failure probability matches that of the os-tree). The first value in last two 
columns is for the os-tree and the second value is for the kd-tree. The query 
time values are in millisecond. In all data sets we tested, the average query time 
using the os-tree is somewhat better than that of the kd-tree. 



Data set 


tree 


dim 


data 

size 


query 

size 


training 

size 


failures 

(%) 


query 

time 


average 

error 


matching 

epsilon 


1 

(level la) 


OS 

kd 


31 


10338 


9360 


2728922 


0.363 

0.363 


0.18696 

0.20085 


0.000143 

0.000202 


0.62 


2 

(level lb) 


os 

kd 


27 


13779 


13599 


2721242 


0.455 

0.441 


0.55077 

0.59048 


0.000173 

0.000258 


0.55 


3 

(level lb) 


os 

kd 


26 


13588 


13638 


2721394 


0.615 

0.615 


0.22363 

0.40695 


0.000231 

0.000315 


0.55 


4 

(level lb) 


os 

kd 


27 


13660 


13710 


2734790 


0.510 

0.503 


0.18453 

0.32020 


0.000207 

0.000317 


0.67 



Fig. 7 . MODIS dataset result 
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Camera 


tree 


dim 


data 

size 


query 

size 


training 

size 


failures 

(%) 


query 

time 


average 

error 


matching 

epsilon 


1 

(2 X 2) 


OS 

kd 


12 


11339 


11277 


2247904 


0.381 

0.399 


0.17 

0.246 


0.000192 

0.000361 


0.39 


2 

(2 X 2) 


OS 

kd 


12 


11339 


11277 


2247904 


0.407 

0.416 


0.187 

0.258 


0.000228 

0.000215 


0.35 


3 

(2 X 2) 


os 

kd 


12 


28136 


28049 


5620115 


0.335 

0.327 


0.329 

0.462 


0.000166 

0.000179 


0.36 


1 

(3 X 3) 


os 

kd 


27 


11263 


11193 


2231414 


0.402 

0.410 


0.352 

0.616 


0.000097 

0.000219 


0.69 


2 

(3 X 3) 


os 

kd 


27 


11263 


11193 


2231414 


0.571 

0.598 


0.427 

0.697 


0.000172 

0.000387 


0.63 


3 

(3 X 3) 


OS 

kd 


27 


27898 


27876 


5578901 


0.304 

0.308 


0.77 

1.293 


0.000134 

0.000130 


0.62 



Fig. 8. Video sequence dataset result 



tree 


dim 


data 

size 


query 

size 


training 

size 


failures 

(%) 


query 

time 


average 

error 


matching 

epsilon 


OS 

kd 


8 


11150 


11053 


2205937 


0.506 

0.479 


0.03528 

0.06514 


0.000242 

0.000171 


4 

0.22 



Fig. 9. World ocean atlas data set result. The first values in last two columns are result 
of the os-tree and the second values are the result of the kd-tree 



The os-tree also outperformed the kd-tree in the video sequence data sets. 
Fig. 0 shows the performance of both trees. 

In the last data set, world ocean atlas, the average query time of the os-tree 
is almost twice as fast as that of the kd-tree (Fig. 0. 
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Abstract. Experimental evidence suggests that spectral techniques are 
valuable for a wide range of applications. A partial list of such applica- 
tions include (i) semantic analysis of documents used to cluster docu- 
ments into areas of interest, (ii) collaborative filtering — the reconstrnc- 
tion of missing data items, and (iii) determining the relative importance 
of documents based on citation/link strncture. Intuitive arguments can 
explain some of the phenomena that has been observed but little theo- 
retical study has been done. In this talk, we present a model for framing 
data mining tasks and a unified approach to solving the resulting data 
mining problems using spectral analysis. In particular we describe the 
solution to an open problem of Papadimitriou, Raghavan, Tamaki and 
Vempala in the context of modeling latent semantic indexing. We also 
give theoretical justification for the use of spectral algorithms for collab- 
orative filtering, and show how a reasonable model of web links justifies 
the robustness of Kleinberg’s web authority/hub algorithm. A major fo- 
cus of the talk will be a description of the tight feedback loop between 
theory and empirical work and how it has led on this project to both 
new theory and new empirical questions of interest. 

This is joint work with Yossi Azar, Amos Fiat, Frank McSherry and 
Jared Saia. 
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Abstract. Suffix array is a widely used full-text index that allows fast 
searches on the text. It is constructed by sorting all suffixes of the text in 
the lexicographic order and storing pointers to the suffixes in this order. 
Binary search is used for fast searches on the suffix array. Compact suffix 
array is a compressed form of the suffix array that still allows binary 
searches, but the search times are also dependent on the compression. In 
this paper, we answer some open questions concerning the compact suffix 
array, and study practical issues, such as the trade off between compres- 
sion and search times, and show how to reduce the space requirement of 
the construction. Experimental results are provided in comparison with 
other search methods. The results show that usually the size of a com- 
pact suffix array is less than twice the size of the text, while the search 
times are still comparable to those of suffix arrays. 



1 Introduction 

The classical problem in string-matching is to determine the occurrences of a 
short pattern P in a large text T. Usually the same text is queried several times 
with different patterns, and therefore it is worthwhile to preprocess the text in 
order to speed up the searches. Preprocessing builds an index structure for the 
text. 

To allow fast searches for patterns of any size, the index must allow access to 
all suffixes of the text. This kind of indexes are called full-text indexes. Optimal 
inquiry time, which is Oim -\- occ) as every character of P of length m = |P| 
must be checked and the occ occurrences must be reported, can be achieved by 
using the suffix tree nmm as the index. In a suffix tree every suffix of the 
text is presented by a path from the root to a leaf. 

The space requirement of a suffix tree is relatively high. It can be 12n bytes 
in the worst case even with a careful implementation 0, where n = |P| is the 
size of 10 . In addition, there is always an alphabet dependent factor on search 
times in any practical implementation. 

* A work supported by the Academy of Finland under grant 22584. 

^ We assume an environment, where an integer takes 32 bits, and a byte takes 8 
bits. The space requirements of the index structures do not include the n bytes for 
representing the text. 
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The suffix array HIE] is a reduced form of the suffix tree. It represents 
only the leaves of the suffix tree in the lexicographic order of the corresponding 
suffixes. A suffix array takes 4n bytes, and provides searching in 0 (to log n + occ) 
time with binary search. This can be further improved to 0{m + log n + occ) by 
storing information about longest common prefixes (LCPs) 1111 . 

There is often repetition in a suffix array such that some areas of the suffix 
array can be represented as links to the other areas. This observation results in 
a new full-text index called compact suffix array PI- By controlling which areas 
are replaced by links, fast searches are still possible. This gives us a trade off 
problem, where the search times depend on the compression. 

In this paper we consider two open problems on the compact suffix array. 
Namely, how to construct a minimal compact suffix array, and how to obtain 
the TCP information for the compact suffix array. We also analyze the space 
requirement of the construction, and show that a minimal compact suffix array 
can be constructed in 0(n log n) time using at most 9n bytes (or 13|n bytes 
if the TCP values are calculated). Then we study experimentally the trade off 
problem between compression and search times, and show that a compact suffix 
array of size less than 2n can still be efficient in searches. 

2 Compact SufRx Array 

Let E be an ordered alphabet and T = tit 2 ■ ■ - G E* a text of length n= \T\. 
A suffix of text r is a substring A suffix can be identified by its 

starting position f £ [1 ... n]. A prefix of text T is a substring The length 

of the longest common prefix (TCP) of suffixes i and i' is min{l \ Ti+i ^ Ti'+i}. 

Definition 1 The suffix array of text T of length n = |T| is an array A(1 . . . n), 
such that it contains all starting positions of the suffixes of the text T such that 
Ta(i)...ti < TA( 2 )...n < ... < TA(n)...n, *-6. the array A gives the lexicographic 
order of all suffixes of the text T. 

A suffix array with TCP information is such that the LCPs between the 
suffixes that are consecutive in the binary search order, are stored as well. Suffixes 
are consecutive in the binary search order if a pattern is compared to them 
consecutively in binary search. Each suffix can have only two preceding suffixes 
in binary search order, and only the maximum of the two values needs to be 
stored together with indication of which value is stored (left or right). The other 
value can be calculated during the binary search HU. An example of the suffix 
array with LCP information is shown in Fig. 1, where column Icp stores the 
LCP values (-|- = left, - = right). With the LCP information, a pattern of length 
TO can be searched in 0 (to -I- logn -I- occ) time, because each character of the 
pattern must be compared only once during the search. 

The idea of compacting the suffix array is the following: Let A > 0. Find 
two areas a ... a -I- A and b . . .b + A of A that are repetitive in the sense that 
the suffixes represented by a ... a -I- A are obtained, in the same order, from the 
suffixes represented by b . . .b+ A by deleting the first symbol. In other words. 
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A{a + i) = A{b + z) + 1 for 0 < i < Z\. Then replace the area a. . .a + A of 
A by a link, stored in A(a), to the area b . . .b + A. This operation is called a 
compacting operation. The areas may be compacted recursively, meaning that 
area h . . .b + A (or some parts of it) may also be replaced by a link. Therefore, 
we define an uncompacting operation to retrieve area a ... a + A from an area 
b . . .b+ A that is recursively compacted using a compacting operation. A path 
that is traversed when a suffix is uncompacted, is called a link path. 

Due to the recursive definition, we need three values to represent a link: 

• A pointer x to the entry that contains the start of the linked area. 

• A value Ax such that entry x + Ax denotes the actual starting point after 
entry x is uncompacted. 

• The length of the linked area A. 

Definition 2 The compact suffix array (CSA) of text T of length n = \T\ is 
an array CA(l . . .n') of length n' , n' < n, such that for each entry 1 < z < n', 
CA{i) is {B,x, Ax, A), where the value of B G {” suffix” link”} determines 
whether x denotes a suffix of the text, or whether x, Ax, and A denote a link 
to an area obtained by a compacting operation. The suffixes that are explicitly 
present in the structure are in the lexicographic order. 

The optimal CSA for T is such that its size n' is the smallest possible. We 
use CA{i).B, CA{i).x, CA{i).Ax, and CA{i).A to denote the value of B, x, Ax, 
and A at entry z of the array CA. An example of the CSA is shown in Fig. 1. 
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Fig. 1. Suffix array of text “gtagtaaac” and the construction of (binary searchable) 
compact suffix array. 



3 Constructing an Optimal CSA from a SufRx Array 

Compact suffix array can be constructed as follows: First, the areas that can be 
replaced by links are marked in the original array. This is done by advancing 
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from the first entry of the array to the last, and finding for each suffix A{i) 
the entry x that contains suffix A{i) — 1. The entry x of suffix A{i) — 1 can be 
found in constant time by constructing the reverse A^ of array A in advance; 
the array A^. is easily computed in linear time by setting Ar{A(i).x) = i for all 
1 . . . n. After the entry x of suffix A{i) — 1 is found, the repetitiveness is checked 
between the areas starting at i and x, and in the positive case, the area starting 
at i is marked to be replaced by a link to the area starting at x. The second 
step is to construct the compacted array by replacing marked areas by links, and 
changing links to refer to the entries of the compact array. Both steps work in 
linear time. An example of the construction is shown in Fig. 1. 

The above method was described in H3), but a heuristic was used to avoid 
cyclic links. Figure 1 shows a simple example of a cyclic link; suffixes 7 and 8 are 
replaced by links to suffixes 6 and 7, i.e. an area is linked to itself. In general, 
cyclic link paths can be generated. However, an area can not be replaced totally 
by itself via link paths; there is always one suffix where the uncompacting can 
start. In the next section, we will give an algorithm that can uncompact these 
kind of cyclic links. Therefore, we can remove the heuristic of and state that 
the above algorithm constructs the optimal CSA. 

4 Binary Search over CSA 

4.1 Uncompacting Cyclic Links 

The searches from the CSA can be done as from the suffix array, using binary 
search. To make the searches efficiently, we need the following modification to 
the definition of the CSA. 

Definition 3 The binary searchable compact suffix array (BSCSA) of text T 
is an array CA(1 . . .n') which is as in the definition of the CSA with following 
limitations: CA{1).B = CA{n').B = suf fix'” and for each i G [2. . .n' — 1] if 
CA{i).B = ^HinC^ then CA{i + l).B = ’’suffix”. 

The algorithm in Sect. 3 that constructs the optimal CSA can be modified 
to construct the optimal BSCSA by just skipping one suffix each time a link is 
created. From this on, we assume that a CSA is a BSCSA (as in Fig. 1). 

As only every other entry of the CSA can be a link, we can make binary 
searches over the entries that contain suffixes; if an entry containing a link is 
chosen to be the next left or right limit, we can choose its preceding entry instead. 
With two binary searches we can find entries I and r such that the pattern may 
occur in entries I .. .r. Then we can uncompact all the links in that area. If an 
entry I contains a link, the beginning of the pattern occurrences must be searched 
after uncompacting that link (with another binary search). Similarly the end of 
the pattern occurrences must be searched from the suffixes represented by the 
entry r. Now, it remains to define how the links can be uncompacted. In uni 
a recursive procedure for this purpose was given under the assumption that no 
cyclic links exist. We now give an algorithm that can uncompact also cyclic links. 
The algorithm is shown in Fig. 2. 
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UnCompactLink(in CA,'m x,in d,in out UCA, in out j,in Z\r, in Ap): 

(1) y <— CA(x).x; Ax <— CA{x).Ax + Ap\ A min{C A{x).A — Ap, Ar) 

(2) j' ■<— j; i - 4 — 0 

(3) while Ax > 0 do begin {Skip links and suffixes that are not required.} 

(4) if CA{y + i).B = false then begin 

(5) i <— i + 1', Ax 4— Ax — 1 end 

(6) else begin 

(7) if CA{y + i).A < Ax then begin 

(8) Ax 4— Ax — CA(y + i).A\ i <— i + 1 end 

(9) else break end end 

(10) while j — j' < A do begin {Uncompact until required suffixes are copied.} 

(11) if CA(y + i).B = '’link” then begin 

(12) UnCompactLink(CA, y + i,d + 1, UCA,j, A — {j — j'), Ax) 

(13) Ax 4— 0 end 

(14) else begin 

(15) UCA{j).x 4— CA{y + i).x + d + 1; } 4— j + 1 end 

(16) *4—1 + 1 end 



Fig. 2. An algorithm to uncompact a link. 



The difference between this version of UnCompactLink and the version in uni 
is that the information about the callers is carried through the recursion, in order 
to uncompact only those suffixes that are needed by the first caller. The ability 
to uncompact cyclic links follows, because no extra suffixes are uncompacted, 
and all the links that do not contain the required suffixes are skipped. 

Theorem 4 Function UnCompactLink works correctly, i.e. it returns an array 
UCA{1 . . . A) that contains the suffixes that are represented by a link {x, Ax, A) 
in a compact sujfx array CA. 

Proof. We use induction for the proof. Consider the recursive calls to the Un- 
CompactLink as a tree, where the first caller is the root and each call to the 
UnCompactLink forms a child node in the tree, whose parent node is the caller. 
The tree is traversed in depth-first order. If the root passes parameters Un- 
CompactLink{C A,x,0,UC A,1,C A{x).A,Q) to its only child, then the required 
information is available at the beginning. Assume that at some point in the 
depth-first traversal, the correct information is passed to a child. 

Now, Ax gets the value of how many suffixes must be skipped in this sub-tree 
(the skip value passed from the parent plus the cumulated skip value in the path 
from root to the parent node), and A gets the value of how many suffixes must 
be uncompacted in this sub-tree (the minimum of the two values: (1) the length 
of the linked area passed from the parent minus the cumulated skip value in 
the path from root to the parent node, and (2) the amount of suffixes that are 
remaining to be uncompacted for the root). Lines (3)-(9) skip all the suffixes 
and links that can not represent the suffixes that are required for the root. 
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If the skipping stops at some link {Ax > 0 after line (9)), then that link is 
uncompacted at line (12) with parameters Ar ^ A — {j—f) = A and Ap ^ Ax, 
which are correct, because Ar gets the value of how many suffixes are remaining 
to be uncompacted in this sub-tree, and Ap gets the value of how many suffixes 
should be skipped. 

If the skipping stops with value Ax = 0 or Z\a; is set to zero at line (13), then 
the suffixes are uncompacted correctly at line (15) or links are uncompacted at 
line (12), until the condition at line (10) becomes unsatisfied (i.e. all required 
suffixes are uncompacted). The correctness of the call at line (12) follows, because 
the parameters get the right values; especially Ar ^ A — {j — j') (the amount 
of required suffixes minus already uncompacted suffixes) and Ap ^ Ax = 0 
(no skipping). Finally, the value of j is updated during the uncompacting, and 
the parent gets the information of how many suffixes were uncompacted by its 
child. □ 



4.2 Constructing LCP Values to Speed Up Searches 

The LCP information can be used to speed up the searches in the CSA similarly 
as in the suffix array. The LCP values are only needed for the suffixes that are 
present in the CSA, and so these values can be stored in the same space that is 
allocated for the fields Ax and A. The LCP values for the CSA can be obtained 
in linear time as follows: 

(1) Use the suffix array construction algorithm of to construct the LCP 
values between the adjacent suffixes in the suffix array. 

(2) Copy the LCP values of adjacent suffixes from suffix array to the suffixes 
that are present in the CSA, modifying the LCP values of the suffixes that 
follow a linked area so that such LCP value is the minimum of the LCP 
values of the suffixes represented by the preceding link and the suffix itself. 

(3) From the LCP values of adjacent suffixes, obtain the LCP values of the 
suffixes that are consecutive in the binary search order as in EH. 

There is one drawback in the above method; the fastest suffix array construc- 
tion algorithms do not provide the LCP values (see Larsson and Sadakane [ I Dj). 
Kasai et. al. 0 gave an algorithm to construct the LCP values from the suffix 
array in linear time, but their algorithm uses more space than an efficient suffix 
array construction. We give here a more space-efficient version of the algorithm 
(Fig. 3). The space-efficiency is achieved by using an array in two purposes and 
an algorithm to reverse a suffix array in linear time using only n extra bits. 

The algorithm Reverse in Fig. 3 reverses a suffix array as follows: the value 
of i is written in the entry SA{i) and the old content denotes the next entry 
where the value SA{i) is written, and so on, until an entry is reached which has 
already been updated. This is repeated for each i. The algorithm works also in 
the other direction (from reverse array to original array). 

The algorithm SAtoLCP in Fig. 3 calculates the LCP values. First, it copies 
the reverse of the suffix array into LCP array, and calculates the LCP values 
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Reverse(in ont SA, in n): 

(1) for each i € [1 . . . n] do begin 

(2) j ■«— SA{i); X •(— i 

(3) while state(j') = false do begin 

(4) j' <— j-, j <— SA(j'); SA{j') <r- x; state(j') •<— true; x ■<— j' end end 
SAtoLCP(in r, in SA, ont LCP): 

(1) n |T|; Copy(5'A, LCP); Reverse(LC'P, n); last <— 0; new •<— 0 

(2) for each i £ [1 . . . n] do begin 

(3) i' £- LCP{i) 

(4) if i' = 1 then LCP{i) ■£- 0 

(5) else begin 

(6) last max{0, last — 1) 

(^) new i CalcLCP(T5^f^'_l^_|_/cts£...n5 PsA(i^)+/as£...n) 

(8) LCP(i) <— last + new — 1 end 

(9) last -4— LCP{i) end 

(10) Reverse (S' A, n); Keverse(SA, LCP,n) 



Fig. 3. Algorithms for reversing a suffix array and calculating the LCP values between 
adjacent suffixes in the suffix array. The call to function CalcLCP(A,B) returns the 
length of the LCPs between strings A and B. The call to Reverse(SA,n) with an 
additional parameter LCP, reverses the LCP array in the same order as SA is reversed. 



from suffix 1 to suffix n. After line (9) the LCP values are stored in reverse order, 
and so the array LCP is reversed at line (10). 

The linear time requirement of algorithm SAtoLCP follows from the fact that 
the comparisons done with the preceding suffix are used in the next comparison. 
Let lcp{i) denote the length of the LCP between suffixes i = SA{x) and 1) 

in the suffix array SA of length n, when 1 < x < n, and let lcp{i) = 0, when 
i = 5'A(1). Now, the sum of the comparisons is lcp{l)+l+lcp{2)—max{Q, lcp{l) — 

1) + IH h lcp{n) — max{0, lcp{n — 1) — 1) + 1 < 2n, where values +1 are due 

to the comparison to the first mismatching character. 

Theorem 5 Given a text T of length n and its suffix array, there is an algorithm 
(Fig. 3) to construct the LCP information between the adjacent suffixes in suffix 
array in 0(n) time using n + 0(1) bits additional work space. □ 

4.3 Limiting Compression to Speed Up Searches 

In the worst case, the searches in the CSA take linear time, because a link may 
represent 0{n) suffixes or the length of a link path to a suffix may be 0(n). To 
provide a guarantee to the search times, we must limit both the lengths of the 
linked areas, and the lengths of the link paths. We introduce constants C and 
D for these purposes. In ca, it was proven that a link can be uncompacted in 
0{C^D') time, where D' = max(l,D — logO). Although we have an improved 
version of function UnCompactLink (Fig. 2), this bound is still valid. 
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Now, we have the following search times for the (C, D) limited CSA\ To 
report occ > 0 occurrences of pattern of length m in text of length n takes 
0(m log C + logn + C^D' + occC D') time in the worst case. To get the time 
needed for existence queries (to determine whether occ > 0), set occ = 0. 

To ease the comparison to other index structures, we can choose D = 
[log [log n]] and C = 2^ — 1. Now, it takes 0(m log log n + log^ n + occ log n) 
time to report occ > 0 occurrences (set occ = 0 for existence queries). 

The average case analysis was done in H3! under the assumption that all 
the areas in the suffix array are linked equally likely (note that this assumption 
follows from assuming uniform Bernoulli model, but is more general) . With this 
assumption, the average values for the length of the linked areas and for the 
length of the link paths are both , where n' is the size of the CSA. Now 
we have 0(m log +logn+ ( ^”[[1” )^ + occ ^"~," ) search time for reporting 

occ > 0 occurrences (set occ = 0 for existence queries) . 

5 Space Requirement 

5.1 Minimal CSA 

An entry of the CSA consists of three integers and one boolean value. Unless 
the text is very compressible, the CSA would occupy more space than the suffix 
array. To guarantee a limit for the search times, we already introduced constants 
C and D to limit the length of linked areas and link paths. It follows that values 
Ax and A can be represented in [log(C+l)] bits. Also value x can be represented 
in [logn] bits. Therefore the CSA can be stored as four bit vectors of sizes n' 
(values B), n'[logn] (values x), and n'[log(C + 1)] (values Ax and A), while 
the access to each entry is guaranteed in constant time using bit operations. 
If the LCP values are not needed, fields Ax and A are unused in the entries 
containing suffixes, and we can use the same array to store both values (if an 
entry i contains a link, then the entry * + 1 contains a suffix, and we can store 
value Ax in the entry i and value A in the entry i + 1). 

To get the most space efficient CSA (i.e. minimal CSA), we can simulate 
the construction algorithm with values Cg {2° — 1,2^ — l,..., 2r*°sA — 1} 
{D = oo) and choose the value that results into the minimal CSA. Therefore 
we can construct the minimal CSA in 0(n logn) time. A minimal CSA takes at 
most the same space as a suffix array, because with value C = 0 the minimal 
CSA can be represented as a suffix array. 

5.2 Reducing the Space Requirement of the Construction 

A straightforward implementation of the construction algorithm in Sect. 3 takes 
16n bytes. We use similar technique as in calculating the LCP values in Fig. 
3 to reduce the space requirement of the construction; the same array is used 
in two purposes. At the beginning, an array is used to store the reverse of the 
suffix array to provide constant access to locate suffixes. At the end, the same 
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array contains the suffix array, in which the linked areas are marked. The same 
technique is used again to compact the marked array. 

The optimal CSA can be constructed using only 8n bytes. To construct the 
minimal CSA or a CSA that provides guaranteed worst case limits for search 
times, we have to include values C and D to the construction algorithm. 

Limiting the lengths of the linked areas with value C is easy; we can just break 
long linked areas into smaller parts. Limiting the lengths of the link paths is more 
complicated, but it can be done by introducing a new array pathdepth{\ . . .n) 
to store path depths for each suffix. Updating the array pathdepth can be done 
in constant time by storing the path depths for both at the begin and at the 
end of the link paths. When two paths are combined, the length of the combined 
paths can be checked and updated in constant time. 

Because the use of the value D is only worthwhile when the value is small, 
we can assume that D fits into a byte. Therefore the whole construction uses 
9n bytes. When LCP values are needed, the construction can still be done using 
13 |n bytes, by using the algorithms presented in Sect. 4.2. 

6 Experimental Results 

We have implemented all the methods discussed in this paper. For testing pur- 
poses it would have been enough to simulate most of the algorithms using easier 
implementations, but as we wanted to test with large datasets, the more space 
efficient implementations became necessary. However, the space efficient imple- 
mentations did have impact on the running times; especially the use of arrays 
with given length bit fields (other than byte, etc.), slowed down the processing 
by a factor of circa 1.5. 

The test-program was written in C-|— k, and compiled with g-k- 1- version 
2. 7. 2. 3 with option -03 in Linux 2.0.38 environment. The tests were run on 
a 700 MHz Pentium HI machine with 768 MB RAM. The running times were 
measured in milliseconds using times library function. 

We used a collection of different size text files and a DNA-sequence (Human 
chromosome 7) in our experiments (see Table 1). The DNA-sequence (68 MB) 
was almost the largest file that could be indexed in our machine; its compact 
suffix array could be constructed in the RAM, but for the LCP values we had to 
use the trivial 0(n log n) expected case (O(n^) worst case) algorithm, because 
the 0{n) algorithm required too much space (13|*68 MB^768 MB). 

There are basically three parameters that control the efficiency of the CSA 
index structure. Values C (maximum length of a linked area) and D (maximum 
length of a link path), and the use of the LCP values, have all effect on both the 
compression and the search times. 

We first examine the effect of the values C and D. We ran two tests varying 
first the values of C and then D on text txtcorpus. The results are shown in Fig. 
4. Instead of measuring time, we calculated the amount of operations that were 
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Table 1. The files that were used in tests are listed below. Some files (marked as “c” 
in table) are from Canterbury and Calgary corpora (http://corpus.canterbury.ac.nz), 
and others (marked as “g”) are from Project Gutenberg (http://www.promo.net/pg). 
To simulate a large text file, we catenated files marked as “1” to a file named txtcorpus. 
The other large file is a DNA-sequence of Human chromosome 7 (the .Ibr in the end 
of 07hgpl0.txt means that line breaks are removed from the original file). 



text 


size 


corpus type 0/1 


text 


size corpus 


type 0/1 


alice29.txt 


152 089 


c 


text 


1 


pge0112.txt 


8 588 144 


g 


text 


1 


biblell.txt 


5 073 934 


g 


text 


1 


plrabnl2.txt 


481 861 


c 


text 


1 


bookl 


768 771 


c 


text 


0 


progp 


49 379 


c 


code 


0 


crstol0.txt 


2 751 761 


g 


text 


1 


shaksl2.txt 


5 582 655 


g 


text 


1 


kjvl0.txt 


4 545 377 


g 


text 


1 


suntxl0.txt 


79 787 


g 


text 


1 


lcetl0.txt 


426 754 


c 


text 


1 


sunzul0.txt 


344 157 


g 


text 


1 


nkrnn09.txt 


2 061 602 


g 


text 


1 


taofjl0.txt 


3 059 101 


g 


text 


1 


nkrnnl0.txt 


2 062 563 


g 


text 


1 


txtcorpus 


43 930 831 




text 


0 


nkrnnll.txt 


2 075 075 


g 


text 


1 


worldl92.txt 


2 473 400 


c 


text 


1 


paper 1 


53 161 


c 


text 


0 


07hgpl0.txt. Ibr 


71 003 068 


g 


DNA 


0 



needed to uncompact the whole arrajfl. It can be seen that both parameters 
affect the uncompacting time considerably, although the behavior is not as bad 
as what could be expected from the worst case analysis. On the other hand, the 
average case analysis is too optimistic, because it is based on the assumption 
that each entry is equally likely to be the start of a linked area, which is not 
true for usual texts. Typically some areas of the text are more compressible than 
others (like white space sequences), and therefore there are areas in a compact 
suffix array where uncompacting is closer to worst case behavior. 

The compression was not improved significantly as C and D grew except 
when they were small. Usually the minimal CSA is achieved with small C (be- 
tween 7 and 63 in our test set), and therefore we used minimal CSAs in the 
remaining tests, and controlled the trade off between compression and search 
times by choosing an appropriate value for D. 

Next we studied the effect of the LCP values. We used three sets of patterns: 
small (4-10), moderate (30-50), and large (200-500). Each set consisted of 10000 
patterns, selected randomly from the files in Table 1 (separate sets for text files 
and DNA). The results are shown in Table 2. 

The LCP values did not provide any significant speed up in our tests. In 
fact. Table 2 shows that searching was usually slower with LCP values; the more 
complicated implementation seems to overrule the effect of LCP values, espe- 
cially because accessing the array fields was much more costly than in standard 
array implementation. Similar behaviour dominated also the searches from the 
suffix array. To really see the effect of LCP values, we tested also with such an 

^ Actually, the whole array could be uncompacted in linear time if needed; we can 
store the suffixes represented by a link in their places in a suffix array at the first 
time we uncompact that link. Before uncompacting a link, we can check if it already 
has been uncompacted. Thus, each suffix can be uncompacted in constant time. 
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Fig. 4. The effect of values C (left) and D (right) on uncompacting the CSA of file 
txtcorpus. In the left, the values of C are 2^ — 1, 2® — 1, , 2^® — 1, while D — 256. In 
the right, the values of D are 2^, 2^, ... , 2®, while C = 31. The worst case behaviour 
was calculated with formula C^{D — log C)l + s, where I is the amount of links in the 
CSA and s is the amount of suffixes in the CSA (i.e. I + s = n'). The average case 
behaviour was calculated similarly. 

Table 2. The effect of the LCP values. We measured the overall time to search (not 
report) 10000 patterns averaged over 100 trials for three sets consisting of different size 
patterns. Three different versions of the CSA were used: CSA is without LCP values, 
CSA_LCP is with LCP values, and CSA_LCPfit is with LCP values that fit into the 
space of fields Ax and A. 





Parameters 






Search times (ms) 




text 


— pattern — 


C 


D 


CSA 


CSA_LCP CSA_LCPfit 


txtcorpus 


4-10 


31 


12 


107 


130 


136 


txtcorpus 


30-50 


31 


12 


164 


180 


188 


txtcorpus 


200-500 


31 


12 


224 


234 


242 


07hgpl0.txt. Ibr 


4-10 


7 


12 


79 


96 


92 


07hgpl0.txt.lbr 


30-50 


7 


12 


116 


121 


116 


07hgpl0.txt. Ibr 


200-500 


7 


12 


187 


176 


168 



extreme text as aioooooo pattern where searching was 19 times faster 

with LCP values in suffix array. 

Finally, we compared CSA with suffix array and sequential search. In sequen- 
tial search, we used the implementation of Boyer-Moore-Horspool algorithm |Zj 
from PI (with a simple modification to report all the matches) . We used the al- 
gorithm of Larsson and Sadakane mu to construct suffix arrays. To guarantee a 
fair comparison, we represented suffix arrays in nflogn] bits instead of the usual 
4n bytes implementation. We did not use LCP values in comparison, because 
they did not provide any speed up in practise. The results are shown in Table 3. 

The results in Table 3 show that compact suffix arrays take much less space 
than suffix arrays, and the penalty in search times is acceptable if there are not 
too many occurrences to report. The removal of the heuristic to avoid cyclic links, 
is crucial (compare the values in column |C'S'A1|). The gap between sequential 
search and indexed search is huge, in favor of the index. 
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Table 3. The comparison of search times and space reqnirement. We measnred the 
overall time to search and report 10000 patterns averaged over 100 trials (except only 
one trial for sequential search). Small patterns were used for texts, and moderate for 
DNA. Column occ denotes either the amount of patterns found (upper line on each 
text) or the amount of occurrences (lower line on each text). CSAl denotes the minimal 
CSA with D — 12, CSA2 denotes CSA with D — [logflogn]] and C = 2^ — 1, SA 
denotes the suffix array, and BMH denotes the Boyer-Moore-Horspool search. The 
values in parentheses in column ICSAl] denote the space requirement when the cycles 
were prevented with the heuristic of uni. 





Index 


size / text 


size 


OCC 


Search times (ms) 


text 


—CSAl— 


— CSA2— 


— SA— 




CSAl 


CSA2 


SA 


BMH 


bookl 


1.71 


1.83 


2.50 


996 


65 


66 


36 


43020 


bookl 


(2.13) 






5176 


67 


68 


36 


46760 


paperl 


1.29 


1.46 


2.00 


152 


52 


46 


25 


1340 


paperl 


(1.69) 






346 


51 


47 


25 


1550 


pge0112.txt 


1.75 


1.90 


3.00 


6173 


85 


80 


48 


312580 


pge0112.txt 


(2.36) 






98557 


133 


116 


49 


520300 


progp 


1.01 


1.26 


2.00 


14 


54 


44 


23 


1190 


progp 


(1.38) 






34 


54 


44 


24 


1370 


txtcorpus 


1.59 


1.84 


3.25 


10000 


106 


91 


56 


1384570 


txtcorpus 


(2.30) 






351354 


301 


212 


60 


2647240 


worldl92.txt 


1.15 


1.41 


2.75 


1700 


131 


85 


42 


130850 


worldl92.txt 


(1.81) 






35368 


150 


96 


42 


148160 


07hgpl0.txt.lbr 


2.68 


2.87 


3.38 


10000 


115 


118 


91 


4038640 


07hgpl0.txt.lbr 


(2.98) 






14542 


124 


124 


91 


8571530 



7 Discussion 

The large space requirement of full-text indexes is often the reason why they 
cannot be used in applications in which they would otherwise be a natural solu- 
tion. We have shown that compact suffix arrays can occupy less than two times 
the size of the text with search times comparable to those of suffix arrays. Also, 
compact suffix arrays can be constructed asymptotically as fast as suffix arrays, 
with similar space requirements at construction time. 

Still, compact suffix arrays do take more space than the text itself (the worst 
case is 0(n log n) bits as in suffix arrays), making them too expensive in many 
applications. There is also recent work considering full-text indexes that occupy 
0(n) bits [fUti] . or even less [Q. They also have some penalty on search times, 
but the experimental results 0 considering the opportunistic index of show 
that the index is quite competitive; an indirect comparison to compact suffix 
array with text worldl92.txt (through comparing relative performance to suffix 
array) show that counting the amount of pattern occurrences is equally fast, 
and reporting the occurrences is 10 times slower than from compact suffix array, 
while the opportunistic index takes 7 times less space than the compact suffix 
array. 



Trade Off Between Compression and Search Times in Compact Snffix Array 201 



References 

1. P. Ferragina and G. Manzini, Opportunistic Data Structures with Applications, In 
Proc. IEEE Symposium on Foundations of Computer Science, 2000. 

2. P. Ferragina and G. Manzini, An Experimental Study of an Opportunistic Index, 
In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2001, to appear. 

3. R. Giegerich, S. Kurtz, and J. Stoye, Efficient Implementation of Lazy Suffix Trees, 
In Proc. Third Workshop on Algorithmic Engineering (WAE99), LNCS 1668, 
Springer Verlag, 1999, pp. 30-42. 

4. G. H. Gonnet and R. Baeza-Yates, Handbook of Algorithms and Data Structures - 
In Pascal and C, Addison- Wesley, Wokingham, UK, 1991. (second edition). 

5. G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: 
Inverted files vs. PAT trees. Technical Report OED-91-01, Gentre for the New 
OED, University of Waterloo, 1991. 

6. R. Gross! and J. Vitter, Compressed suffix arrays and suffix trees with applications 
to text indexing and string matching. In Proc. 32nd ACM Symposium on Theory 
of Computing, 2000, pp. 397-406. 

7. R. N. Horspool, Practical fast searching in strings, Soft. Pract. and Exp., 10, 1980, 
pp. 501-506. 

8. T. Kasai, H. Arimura, and S. Arikawa, Virtual suffix trees: Fast computation of 
subword frequency using suffix arrays. In Proc. 1999 Winter LA Symposium, 1999, 
in Japanese. 

9. J. Karkkainen, Repetition-Based Text Indexes, PhD Thesis, Report A-1999-4, De- 
partment of Computer Science, University of Helsinki, Finland, 1999. 

10. N. Jesper Larsson and K. Sadakane, Faster Suffix Sorting, Technical Report, num- 
ber LU-CS-TR:99-214, Department of Computer Science, Lund University, Swe- 
den, 1999. 

11. U. Manber and G. Myers, Suffix arrays: A new method for on-line string searches, 
SIAM J. Comput., 22, 1993, pp. 935-948. 

12. E. M. McGreight, A space economical suffix tree construction algorithm. Journal 
of the ACM, 23, 1976, pp. 262-272. 

13. V. Makinen, Compact Suffix Array, In Proc. 11th Annual Symposium on Combi- 
natorial Pattern Matching (CPM 2000), LNCS 1848, 2000, pp. 305-319. 

14. E. Ukkonen, On-line construction of suffix-trees, Algorithmica, 14, 1995, pp. 249- 
260. 

15. P. Weiner, Linear pattern matching algorithms. In Proc. IEEE 14th Annual Sym- 
posium on Switching and Automata Theory, 1973, pp. 1-11. 




Implementation of a PTAS 
for Scheduling with Release Dates* 



Clint Hepner and Cliff Stein 

Dartmouth College 
{chepner, cliff }(5cs . dartmouth.edu 



Abstract. The problem of scheduling jobs with release dates on a sin- 
gle machine so as to minimize total completion time has long been know 
to be strongly NP-complete. Recently, a polynomial-time approximation 
scheme (PTAS) was found for this problem. We implemented this algo- 
rithm to compare its performance with several other known algorithms 
for this problem. We also developed several good algorithms based on 
this PTAS that run faster by sacrificing the performance guarantee. Our 
results indicate that the ideas used by this PTAS lead to improved algo- 
rithms. 



1 Introduction 

Scheduling n jobs with release dates on a single machine is a heavily-studied 
basic scheduling problem ii7i,4iRmq . In this paper, we study the 

objective of minimizing the total completion time in this model. Since this prob- 
lem is strongly NP-complete, much research in recent years has focused on find- 
ing heuristics and approximation algorithms. Several approximation algorithms 
which find schedules whose objective value is within a small constant factor of 
optimal have been discovered; see |MTTj for details. A polynomial-time approx- 
imation scheme (PTAS) is an approximation algorithm which finds solutions 
within a (1 -I- e)-factor of optimal for any fixed e > 0. Recently, a PTAS was 
designed for this problem p. 

Usually, the running time of a PTAS is so large that implementing the al- 
gorithm is not feasible. For example, the running time is commonly polynomial 
in n but with 1/e in the exponent (n'^*'^/'^^), or the constant factor is a function 
of e, as in 0(/(e)n'^). For this reason, it is rare that a PTAS yields an efficient 
implementation of an algorithm. A fully polynomial-time approximation scheme 
(FPTAS) has a running time that is polynomial in n and 1/e. For example, an 
FPTAS for the KNAPSACK problem runs in 0(nlg(l/e)-|-l/e'*) time PH]- Such 
algorithms have a “nice” dependence on e and run quickly. The scheduling PTAS 
discussed in this paper has a running time of 0{n\gn + (1/e®)!). Although the 
constant term is large, the fact that it is not coupled with the nlgn term makes 
it plausible that the algorithm can be implemented to run quickly. 

* Research partially supported by NSF Career Award CCR-9624828, NSF Grant EIA- 
98-02068, NSF Grant DMI-9970063 and an Alfred P. Sloane Foundation Fellowship. 
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In this paper, we describe an implementation of the PTAS by Karger and 
Stein from which we call S-PTAS. We implement several other exact, heuris- 
tic, and approximation algorithms, and compare the performance of the PTAS 
to these other algorithms. We also try to find better algorithms based on the 
ideas developed in S-PTAS. Given the limited practical success of implement- 
ing approximation schemes, we look for lessons in our implementation that may 
be applicable to other approximation schemes. Our results show that although 
S-PTAS, when implemented strictly, does not perform very well, it is possible to 
modify the implementation to develop strong heuristic algorithms that perform 
well in practice. Our method for doing so focuses on the impact of rounding and 
enumeration and thus may be applicable to approximation schemes for other 
problems as well. 

2 Terminology 

Let J be a set of n jobs. For each job j = 1,2, ... ,n, the release date and 
processing time are Vj and pj , respectively. We also refer to the processing time 
as the size of the job. For the one-machine scheduling problem discussed in this 
paper, a schedule is an assigment of jobs to time slots on the machine so that 
no job begins processing before its release date and at most one job is being 
processed at any time. Job j must receive Pj units of processing. The completion 
time of the job is denoted Cj{a). When the schedule is clear from context, start 
and completion times are abbreviated Sj and Cj, respectively. 

Usually, each job j must run non-preemptively (without interruption), so that 
Cj = Sj +Pj. If the problem specifies a preemptive schedule, scheduled jobs may 
be interrupted and resumed at a later time, so that Cj > Sj + pj . For the rest of 
the paper, we discuss non-preemptive schedules unless otherwise stated. Some 
specific schedules are denoted by lowercase Greek letters. Optimal schedules are 
denoted by tt. Optimal preemptive schedules are denoted by 4>- For any schedule 
cr, we abbreviate the sum '^j^j Cj{(i) as G(cr), and we refer to this value as 
the total completion time (or TGT) of the schedule. The total completion time 
of the optimal schedule is abbreviated as OPT. In this paper, the problem we 
study is to minimize the total completion time of jobs with release dates on one 
machine. 

To simplify the descriptions of the some of the algorithms, we use the follow- 
ing notation from Ghu0. Let Rj(L\) = max{rj. A} be the earliest time job j can 
begin if it hasn’t been scheduled before time A. Similarly, let Ej(Z\) = Kj{A)+pj 
be the earliest time a job can complete if it hasn’t been scheduled before time 

Zi. 

Since our problem is strongly NP-complete H21, there are several ways to 
approach it. Exact algorithms provide the optimal schedule, but they neces- 
sarily run slowly on some inputs. Heuristic algorithms utilize simple rules for 
selecting jobs. They run quickly but without a bound on how far the schedule 
could be from optimal. Approximation algorithms fall between the two. A p- 
approximation algorithm is a polynomial-time algorithm that returns a schedule 
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a such that C{a)/C{Tr) < p. In this paper, we study a polynomial-time approxi- 
mation scheme (PTAS), which is a ( 1 -I- e) -approximation algorithm for any fixed 
e > 0. The running time of a PTAS, while polynomial in the size of the input, 
is not necessarily polynomial in 1/e. 



3 Algorithms 

We implement S-PTAS and compare its performance to some heuristic algo- 
rithms and another approximation algorithm. First, we describe each algorithm 
we implemented. Afterwards, we analyze each in terms of running time and 
performance guarantees. 



3.1 Algorithm Descriptions 

The first two algorithms described are exact algorithms which give optimal so- 
lutions: one is for the stated problem, the other is for its preemptive relaxation. 
The second two are heuristics with no performance guarantees. The final al- 
gorithms are approximation algorithms: a small-constant-factor approximation 
algorithm and S-PTAS. 

BNB-Chu. This branch-and-bound algorithm was developed by Chu 0. The 
nodes in the search tree represent partial schedules. A node at depth i corre- 
sponds to a schedule of i jobs. Each child represents the addition of one of the 
n — i remaining jobs to the end of the schedule. The lower and upper bounds for 
a node are computed by the SRPT and APRTF algorithms, respectively (both 
are described below). The upper bound of the root node is found by taking the 
best schedule found among four different heuristics. To reduce the branching 
factor of the tree, nodes are only considered if an optimal schedule exists with 
that node’s partial schedule as a prefix. A branch is pruned if the lower bound 
of a node exceeds the best upper bound found so far. The optimal schedule is 
returned when a leaf node passes an optimality test, or when the entire search 
tree has been explored. 

SRPT. SRPT (shortest remaining processing time) gives the optimal pre- 
emptive schedule. For each job j, we maintain a variable Xj representing the 
remaining processing time. This value is initialized to pj when the job arrives 
at time rj. When the machine is available, the job with the smallest value of 
Xj is chosen. When a job k arrives, we compare it to the currently running job 
j. If Xk < Xj, we preempt job j and begin running job k. While a job runs, we 
update Xj to reflect the amount of processing done. 

SPT. SPT (shortest processing time) is a simple, well-known algorithm. When- 
ever the machine is available at time t, the algorithm chooses to run, from those 
jobs released by time t, the unscheduled job j with minimum pj. If no jobs are 
available, the algorithm waits until the next release date. 
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APRTF. APRTF 0] is an extension of a simpler algorithm PRTF. PRTF ^ 
always chooses at time t the job j that minimizes Rj (t) +Ej (t) . APRTF considers 
two jobs a and (3, the jobs that would be chosen by PRTF and SPT respectively. 
If a is the job with the earlier release date, APRTF schedules a. Otherwise, 
APRTF chooses which job to run in the following way. Consider the effect of 
running a first; this would be the optimal choice if only jobs a and /3 remained. 
However, scheduling /3 first could reduce the completion times of future jobs. 
APRTF estimates the total completion time of the schedule resulting from each 
choice, and chooses to run the job that gives the smaller estimate. 

Best-Alpha. Best-Alpha 0 schedules jobs based on the notion of an a-point. 
Let 0 < a < 1. For a given preemptive schedule, the a-point of a job is the time 
by which it has completed an a-fraction of its processing. The corresponding 
a-schedule is the non-preemptive schedule in which jobs are scheduled in order 
of their a-points. Different values of a produce different a-schedules from the 
same preemptive schedule. This algorithm starts with the optimal preemptive 
schedule and finds the a-schedule with the smallest TCT over all values of a, 
where 0 < a < 1. 

S-PTAS. We present here the PTAS from |P, which we call S-PTAS. For later 
reference, we present it as pseudo-code. To simplify the notation, we define 
ROUND(cc, e) = (1 -h e) as the operator that rounds x up to the nearest 

integer power of 1 -I- e. 

The input to S-PTAS is a set of jobs J = {!,..., n}, with a release date Xj 
and a processing time pj for each job j; and a real value e > 0. 

1: For each j G J: r'j ^ ROUND (rj,e), p' ^ ROUND (pj,e). 

2: For each j G J: r'^ G- max{r',ep'}. 

3: For each j G J: r" ^ max{r' , e/p' ^}. 

4: Schedule jobs with the SPT algorithm, using r" for release dates and p' for 
processing times, until 1/e® jobs remain. Let J' be these remaining jobs, with 
r' and p' as release dates and processing times. Denote by a' the schedule 
of the jobs J — J'. 

5: Enumerate over all schedules which have cr' as a prefix, followed by the jobs 
in J' . Return the schedule with minimum TCT. 

Smaller values of e lead to better schedules but slower running times. 



3.2 Algorithm Analysis 

In this section, we provide the running time and further discussion on each 
algorithm. For each heuristic algorithm, we provide an instance for which the 
algorithm gives a solution whose cost is an C(n) factor from optimal. For the 
approximation algorithms, we discuss the performance bound. 

BNB-Chu. We use Chu’s algorithm to compute optimal schedules for most of 
the instances for the tests in this paper. Chu showed in |S| that his algorithm was 
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both faster and more space-efhcient than other exact algorithms. His algorithm 
finds optimal schedules for instances of up to 100 jobs within a few minutes. The 
running time is exponential in the number of jobs, with n! possible nodes in the 
search tree. In practice, the running time is greatly reduced by the dominance 
properties and the tight bounds provided by SRPT and APRTF but remains 
exponential in the worst case. 

SRPT. For the non-preemptive scheduling problem, SRPT is used to generate 
lower bounds on OPT. The optimal preemptive schedule also provides a starting 
point for Best- Alpha. The running time is 0(n Ign). 

SPT. The shortest processing time (SPT) algorithm is the one of the simplest 
algorithms for the non-preemptive version of our scheduling problem. Although 
SPT performs well on many instances, it can give a schedule with a total com- 
pletion time as large as I7(n) times the optimal. Consider an instance with a 
large job of size B released at time 0, followed by n — 1 jobs of size 1 released at 
time 1. SPT will schedule the large job first, giving a TCT of nB -|- ( 2 ), while 
the optimal schedule waits until time 1 to run all the small jobs first for a TCT 
of n -I- R -f ( 2 ) . For B :$> n, the ratio C{(t)/C('k) approaches n. The running 
time is O(nlgn). 

APRTF. This algorithm, introduced by Chu in P| and used to generate upper 
bounds in his branch-and-bound algorithm, produces good schedules in practice. 
However, this algorithm can also give a schedule with a total completion time 
as large as fl{n) times the optimal schedule. An instance with a large job of 
size released at time 0, and many small jobs of size 1 released at time n — 1 
achieves this bound; we omit the details here. The running time of the algorithm 
is O(nlgn), but it is slightly slower than SPT due to the extra work performed 
in choosing a job to run. 

Best- Alpha. The idea of an a-schedule comes from 0. Since a preemptive sched- 
ule has at most n — 1 preemptions, there are at most n values of a (including 
a = 1) that produce unique a-schedules. Since these n values can be found in 
linear time from the optimal preemptive schedule. Best- Alpha can generate all 
of the unique a-schedules and return the best one in 0{n^ Ig n) time. It is shown 
in PI that Best- Alpha is a 1. 58-approximation algorithm. 

S-PTAS. In polynomial-time approximation schemes are given for many 
versions of the problem of minimizing total weighted completion time in the 
presence of release dates. The algorithm studied here, developed by Karger and 
Stein, is the PTAS for the one-machine release date scheduling problem. 

We give a quick overview of the analysis of S-PTAS; a detailed analysis 
is in p. The analysis is best understood by thinking of S-PTAS as a series 
of transformations applied to a given optimal schedule tt. Referring back to 
the pseudo-code for S-PTAS, applying step 1 and step 2 to the jobs increases 
the optimal schedule by at most a factor of 1 -I- e each. Rescheduling the first 
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n — 1/e® jobs in SPT order also increases the optimal schedule by at most a 1 + e 
factor. The remaining jobs can be scheduled optimally to find the best resulting 
schedule. S-PTAS can thus find this modified optimal schedule by modifying the 
job values, scheduling some jobs in SPT order, then using an optimal algorithm 
to finish scheduling a constant number of jobs. 

We will further discuss the effects of rounding the job values later in the 
paper. 

4 Implementation of Algorithms 

In this section, we discuss how we implemented each of the algorithms for our 
tests. 

4.1 Existing Algorithms 

The implementations for the branch-and-bound algorithm, SRPT, and APRTF 
are taken from code provided by Chengbin Chu. All are coded in C. 

We wrote the code for finding a-schedules in C. The implementation takes 
a rational number a, then scans the preemptive schedule </> provided by SRPT. 
Each time a piece of a job finishes in (/, the total fraction of that job that has 
completed is determined. When the total completed fraction of a job exceeds a, 
that job is added to the a-schedule. Best- Alpha finds the set of effective values 
for a by determining the a-fraction that each preempted job has completed at 
the time it is preempted. This set also includes a = 1.0. 

4.2 S-PTAS 

We implemented S-PTAS in C-|— I- using the standard library. Although C-I--I- 
often generates slightly slower code than C, the most time-consuming portion 
of S-PTAS uses the C implementation of Chu’s algorithm as a subroutine. The 
C-|— I- sections of the code only do file I/O and a fast SPT implementation, 
so the running times can be compared to the other algorithms. We implement 
the algorithm with two additional heuristics. Neither heuristic violates the 1 -I- e 
performance guarantee; the first can decrease the running time, while the second 
improves the schedule quality. 

The first heuristic gives an improved test for deciding when to stop scheduling 
in step 4. The original algorithm PJ stops the SPT scheduling step when the 
makespan reaches a threshold of T = e^C'(Tr). The algorithm can’t know T, 
but n — 1/e® is a lower bound (from P^) on the number of jobs that can be 
scheduled before time T. Since T' = e^C{cj>) is also a lower bound on T, we 
allow the algorithm to continue scheduling with SPT if we haven’t reached T' 
with the first n — 1/e® jobs. In either case, we then enumerate over all possible 
schedules which begin with this partial schedule by increasing the release dates 
of the remaining jobs to the makespan of the partial schedule, then running 
Chu’s algorithm on those jobs. 
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The second heuristic computes the TCT of the schedule using the original 
release dates and processing times. The performance guarantee is based on the 
inflated TCT, so recomputing the true TCT with the smaller Vj and pj can only 
lead to better schedules. 

5 Experiments 

We ran several tests to examine the effectiveness of S-PTAS compared to the 
existing algorithms. We compare S-PTAS against APRTF and Best- Alpha, and 
we look at the effect of rounding on the speed and accuracy of S-PTAS. 

First, we describe how we generated the data for each test. Then, we will 
describe the tests in detail. 

5.1 Data 

We generate test instances using the approach of Hariri and Potts . For each 
job instance, release dates are chosen uniformly from the interval [0,A[p]nA]. 
This simulates the arrival of a series of n jobs arriving into a stable queue 
according to a Poisson process with parameter A. E[p] is the expected pro- 
cessing time of each job, dependent on the current distribution of pj. For all 
instances, we create n = 150 jobs. We create five instances for each value of A 
from {0.2,0.4,0.6,0.8,1.0,1.25,1.5}. For the tests involving S-PTAS, we chose 
processing times uniformly from [0, 100]. 

We performed some tests on instances larger than 150 jobs, using the SRPT 
schedules for comparison instead of the optimal schedules. Since the results ob- 
tained from these tests were similar to those presented below, we omit them 
from discussion. 

Based on the previous analysis of SPT and APRTF and the analysis of Best- 
Alpha in cni, we could have generated instances on which these algorithms did 
poorly. Having confirmed that S-PTAS finds good schedules for the algorithms’ 
bad instances, we focus here on randomly generated instances on which none of 
the algorithms are known to be particularly strong or weak. 

5.2 Tests 

All algorithms were tested on a dual-processor 500MHz Pentium HI with 512 
MB of main memory, running the Linux 2.2.14 kernel. We look at both quality of 
the schedule and runtime of the algorithm as our units of measure. For each value 
of A, the results for the five instances are averaged together. We are primarily 
interested in how the schedules generated by S-PTAS and its variants compare to 
other schedules and in how our speed-up attempts on S-PTAS affect its solution 
quality. 

We set e = 0.4 for the tests. This value forces S-PTAS to schedule roughly 
100 jobs optimally, which is close to the largest instances Chu’s algorithm can 
solve quickly (that is, in one minute or less). A value less than 0.58 also helps 
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underscore the fact that S-PTAS is superior to Best- Alpha, in theory. In general, 
we found that schedule quality improves as e decreases. 



6 Results 

We now discuss the results of running each algorithm on the test data. First 
we look at the schedules generated by APRTF, Best-Alpha, and S-PTAS, and 
compare to the optimal algorithm. Then, we see how S-PTAS can be modified 
to run more quickly while remaining competitive. 



6.1 Performance of PTAS 

Table Qgives the values of the schedules produced by Chu’s algorithm, APRTF, 
Best- Alpha, and S-PTAS. For the non-exact algorithms, we also give the relative 
error. The running times of the algorithms are given in milliseconds. The results 
are broken down by the release date parameter A, with the average over all A at 
the bottom. 



Table 1. Total Completion time (OPT, APRTF, Best- Alpha, S-PTAS) 





OPT 


APRTF 


Best- Alpha 


S-PTAS 


A 


TCT 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.2 


403911.8 


405110.6 


0.30% 


< 1 


404259.4 


0.09% 


18 


436933.2 


8.18% 


26 


0.4 


438990.2 


439129.4 


0.03% 


< 1 


439506.4 


0.12% 


22 


491142.8 


11.88% 


512 


0.6 


470138.4 


470362.8 


0.05% 


< 1 


470812.2 


0.14% 


26 


543132.8 


15.53% 


1584 


0.8 


527636.2 


527848.8 


0.04% 


< 1 


528178.2 


0.10% 


30 


613026.2 


16.18% 


1414 


1.0 


585970.4 


586192.2 


0.04% 


< 1 


586388.4 


0.07% 


30 


679953.4 


16.04% 


72 


1.25 


737328.0 


737593.8 


0.04% 


< 1 


737557.0 


0.03% 


32 


875744.6 


18.77% 


22 


1.5 


870683.4 


870808.6 


0.01% 


< 1 


870916.2 


0.03% 


20 


1013163.6 


16.36% 


20 


Avg 


576379.8 


576720.9 


0.06% 


< 1 


576802.5 


0.07% 


25 


664728.0 


15.33% 


521 



S-PTAS generates better schedules than are guaranteed by the value of 
e used. However, APRTF (running time O(nlgn)) and Best- Alpha (running 
time O(n^lgn)) are considerably faster and produce better schedules. Overall, 
S-PTAS as implemented here doesn’t appear to be an attractive alternative to 
other algorithms. Given these results, we decided to look at necessity of rounding 
job values in step 1 of the algorithm. 

Table El shows the effect on schedule quality by only rounding release dates 
or processing times, or neither, in step 1 of S-PTAS. Points in the graph represet 
data sets. The vertical axis is running time of S-PTAS on the data; the horizontal 
axis is the relative error acheived by S-PTAS on the data. S-PTAS was run on 
each set of data four times, with different job values rounded for each run. 
Increasing the release dates and processing times can only increase completion 
times, and it turns out to be necessary only to make S-PTAS run faster, not 
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Table 2. Total completion time (S-PTAS with various rounding schemes) 





rj 


and pj 




1 


•j only 




A 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.2 


436933.2 


8.18% 


26 


420092.0 


4.01% 


3480 


0.4 


491142.8 


11.88% 


512 


471639.8 


7.44% 


121280 


0.6 


543132.8 


15.53% 


1584 


622361.2 


32.38% 


64418 


0.8 


613026.2 


16.18% 


1414 


595805.0 


12.92% 


50992 


1.0 


679953.4 


16.04% 


72 


673100.2 


14.87% 


168 


1.25 


875744.6 


18.77% 


22 


874914.0 


18.66% 


18 


1.5 


1013163.6 


16.36% 


20 


1013416.6 


16.39% 


20 


Avg 


664728.0 


15.33% 


521 


667332.0 


15.78% 


34339 





Pj only 


no rounding 


A 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.2 


442367.2 


9.52% 


30 


417129.4 


3.27% 


176 


0.4 


496659.8 


13.14% 


162 


456036.0 


3.88% 


6838 


0.6 


536538.2 


14.12% 


320 


487797.6 


3.76% 


85388 


0.8 


593531.2 


12.49% 


1316 


541006.4 


2.53% 


347646 


1.0 


645093.6 


10.09% 


1556 


596417.6 


1.78% 


200236616 


1.25 


762066.6 


3.36% 


1150 


738970.4 


0.22% 


49894 


1.5 


878538.6 


0.90% 


232 


873337.2 


0.30% 


380 


Avg 


622113.0 


7.93% 


680 


587242.0 


1.88% 


28675276 



to guarantee the 1 + e-approximation. We therefore looked at the effect of not 
rounding release dates and processing times. 

Figure n shows the tradeoffs in schedule quality and running time that come 
with rounding. This figure graphs the relative error and running time data found 
in Table 0 

Rounding both values gives the fastest solution, with all running times under 
one second, but the poorest schedules. This happens because step 5 of S-PTAS 
has fewer unique jobs to consider, reducing the branching factor of the search 
tree. This same homogenization causes small sequences of jobs to be scheduled 
in the wrong order, which becomes apparent after the original job values are 
restored. 

Rounding only the processing times or the release dates improves the sched- 
ules at the expense of run time. Overall, it seems that rounding only the pro- 
cessing times is sufficient for a useful algorithm. When all the processing sizes 
are equal, the problem is easily solvable in polynomial time since the optimal 
preemptive schedule would never need preemption. Having one non-zero release 
date is sufficient for proving the problem NP-hard m 

Not rounding any values gives the most accurate schedules, but increases the 
run time since the branching factor of the tree stays high. The schedules are 
accurate because slight differences in jobs aren’t masked by rounding. 
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Fig. 1. Schedule quality vs. time for S-PTAS w/ different rounding schemes. Each point 
represents a set of instances scheduled by S-PTAS with various job values rounded 



6.2 S-PTAS Variants 

The above results show that we can skip step 1 of S-PTAS and obtain better 
schedules at the expense of speed. We look now at algorithms that skip rounding 
and find another way to increase speed. One way is to decrease the number of 
jobs scheduled optimally at the end of the schedule. We now introduce two algo- 
rithms, S-PTASffc and SPT-h, that do this, at the cost of losing the performance 
guarantee. In both algorithms, we skip step 1 of the algorithm, but continue to 
increase release dates in step 2. Both algorithms skip step 3 as well; in practice 
better schedules were achieved without it. 

S-PTASflz. This family of algorithms schedules k jobs in step 5 of S-PTAS, 
independent of the limit imposed by e. In S-PTAS, the number of jobs remaining 
by step 5 increases quickly as e approaches zero. The relative error induced by 
running a job at time t + 5 instead of time t, for some 5, becomes smaller as 
t increases. Thus, as the schedule grows, jobs can be scheduled less carefully 
without hurting the total completion time of the schedule. Fixing the number of 
leftover jobs to a value independent of e allows the faster SPT step to schedule 
more of these “late” jobs. As with S-PTAS the performance of S-PTASffc was as 
good or better on instances larger than 150 jobs, so we omit discussion of those 
tests. 
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SPT-h. A special case of S-PTASffc is an algorithm that sets fc = 0, scheduling all 
jobs in step 4 and ignoring step 5 of S-PTAS altogether. This algorithm is called 
SPT-h. The idea behind this algorithm is that the SPT algorithm gives good 
schedules on most instances, failing only in extreme cases where very large jobs 
are released just prior to large numbers of small jobs. By increasing release dates 
in relation to job sizes, one hopes the errors caused by SPTnever accumulate 
enough to require compensating with an optimal ordering. 

For the tests involving the S-PTAS variants S-PTASfA: and SPT-h, we gener- 
ated two more data sets with processing times obeying other distributions. The 
first of these uses a normal distribution with mean /i = 50 and variance = 25. 
The second uses a bimodal distribution, with jobs taken with equal probabil- 
ity from one of two normal distributions: one with mean /ii = 25 and variance 
a1 = 25, the other with mean /ii = 75 and variance crj = 5. For the bimodal 
distribution, we were unable to compute optimal schedules for most instances, 
so we substitute the SRPT lower bound for computing relative error. 

Since the running time of these algorithms is independent of 1/e, we tried 
many values for e to see how schedule quality and run time change with increasing 
e. For S-PTASffc, we tried values of 50 and 75 for k. Tables 00, and 0 give the 
results of the S-PTAS variants for the uniform, normal, and bimodal processing 
time distributions, respectively. For each of the variants, we give the TCT, the 
relative error, and the running time in milliseconds. 

SPT-h is about as fast as APRTF, Best-Alpha, and SPT. For the uniform 
and normal distributions, SPT does well, so there is no obvious need for SPT-h. 
However, SPT does a poor job with the bimodal distribution, which creates 
instances similar to the bad instance given for SPT previously. SPT-h does quite 
well here, indicating that increasing release dates is beneficial. 



Table 3. Total completion time, uniform pj 



OPT 


APRTF 


Best- Alpha 


SPT 


576379.8 


576720.9 0.05% 


576802.5 0.07% 


576743.4 0.06 % 





SPT-h 


S-PTASf50 


S-PTASf75 


e 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.01 


576743.4 


0.06% 


18 


576700.3 


0.06% 


556 


576667.4 


0.05% 


25496 


0.05 


576723.7 


0.06% 


18 


576681.7 


0.05% 


558 


576649.3 


0.05% 


24817 


0.10 


576746.7 


0.06% 


16 


576703.8 


0.06% 


558 


576669.4 


0.05% 


24071 


0.20 


576774.2 


0.07% 


16 


576731.5 


0.06% 


558 


576700.6 


0.06% 


24247 


0.30 


576742.6 


0.06% 


17 


576698.9 


0.06% 


558 


576666.7 


0.05% 


24637 


0.40 


576772.6 


0.07% 


16 


576730.6 


0.06% 


519 


576699.7 


0.06% 


24358 


0.50 


576756.9 


0.07% 


16 


576714.0 


0.06% 


490 


576756.9 


0.07% 


18 


0.75 


576941.2 


0.10% 


16 


576941.2 


0.10% 


19 


576941.2 


0.10% 


19 


1.00 


577362.5 


0.17% 


16 


577362.5 


0.17% 


17 


577362.5 


0.17% 


19 


1.50 


578303.6 


0.33% 


17 


578303.6 


0.33% 


18 


578303.6 


0.33% 


16 


2.00 


579204.6 


0.49% 


17 


579204.6 


0.49% 


17 


579204.6 


0.49% 


19 
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Table 4. Total completion time, normal pj 



OPT 


APRTF 


Best-Alpha 


SPT 


458065.9 


458325.1 0.056% 


458374.9 0.067% 


458336.9 0.059% 





SPT-h 


S-PTASf50 


S-PTASf75 


e 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.01 


458336.9 


0.06% 


15 


458323.4 


0.06% 


53 


458312.0 


0.05% 


458 


0.05 


458357.5 


0.06% 


16 


458344.0 


0.06% 


51 


458332.7 


0.06% 


458 


0.1 


458323.8 


0.06% 


15 


458310.3 


0.05% 


52 


458304.6 


0.05% 


458 


0.2 


458328.7 


0.06% 


16 


458315.4 


0.05% 


52 


458302.5 


0.05% 


458 


O 

CO 


458473.5 


0.09% 


17 


458460.1 


0.09% 


53 


458451.1 


0.08% 


458 


0.4 


458717.2 


0.14% 


16 


458704.3 


0.14% 


53 


458693.3 


0.14% 


459 


0.5 


459034.7 


0.21% 


16 


459019.6 


0.21% 


53 


459034.7 


0.21% 


459 


0.75 


459717.1 


0.36% 


15 


459717.1 


0.36% 


18 


459717.1 


0.36% 


460 


1.0 


460557.5 


0.54% 


16 


460557.5 


0.54% 


18 


460557.5 


0.54% 


461 


1.5 


462006.7 


0.86% 


15 


462006.7 


0.86% 


18 


462006.7 


0.86% 


462 


2.0 


463426.3 


1.17% 


15 


463426.3 


1.17% 


17 


463426.3 


1.17% 


463 



Table 5. Total completion time, bimodal pj 



SRPT 


APRTF 


Best- Alpha 


SPT 


460422 


461274.4 0.23 % 


461557.3 0.25% 


2306388.9 400 % 





SPT-h 


S-PTASf50 


S-PTASf75 


e 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


TCT 


Rel. Err. 


ms 


0.01 


461277.8 


0.19% 


16 


461269.3 


0.18% 


179 


461255.0 


0.18% 


461 


0.05 


461287.7 


0.19% 


17 


461279.2 


0.19% 


179 


461264.5 


0.18% 


461 


0.1 


461331.7 


0.20% 


16 


461323.2 


0.20% 


179 


461311.6 


0.19% 


461 


0.2 


461412.0 


0.22% 


17 


461402.4 


0.21% 


179 


461396.3 


0.21% 


461 


0.3 


461654.1 


0.27% 


15 


461646.7 


0.27% 


179 


461629.9 


0.26% 


462 


0.4 


461862.5 


0.31% 


16 


461855.1 


0.31% 


176 


461841.5 


0.31% 


462 


0.5 


462064.7 


0.36% 


15 


462056.2 


0.35% 


177 


462064.7 


0.36% 


462 


0.75 


462686.3 


0.49% 


15 


462686.3 


0.49% 


18 


462686.3 


0.49% 


463 


1.0 


463382.2 


0.64% 


15 


463382.2 


0.64% 


18 


463382.2 


0.64% 


463 


1.5 


464603.4 


0.91% 


17 


464603.4 


0.91% 


17 


464603.4 


0.91% 


465 


2.0 


465724.1 


1.15% 


17 


465724.1 


1.15% 


18 


465724.1 


1.15% 


466 



In some cases, S-PTASffc is slower than the other algorithms. However, the 
running time of the faster algorithms is so small (on the order of microseconds) 
that even though the running time of S-PTASffc can be orders of magnitude 
greater, it is still only a few seconds at most. Further, since S-PTASffc spends 
virtually its entire time solving the fixed-size scheduling problem, the running 
time would increase very little as n increases. Comparing S-PTASffc to SPT-h, 
we see that scheduling some jobs optimally is better than none, with schedule 
quality increasing with k. 
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7 Related Work 

In addition to the algorithms discussed previously in this paper, we should men- 
tion two other algorithms. Both are 2-approximations that schedule jobs in SPT 
order after some modification of release dates, as SPT-h does. S-SPT, developed 
by Stougie and mentioned in jO], increases the release date of job j to r' = Vj+pj. 
D-SPT, introduced in [0|, increases the release date to max{rj,pj}. Note that 
our algorithm SPT-h can also be seen as a generalization of D-SPT; SPT-h run 
with e = 1.0 is identical to D-SPT. 

8 Conclusions 

In this paper, we have examined the performance of S-PTAS along with the 
performance of several algorithms derived from S-PTAS. Our tests show that, 
when implemented strictly, S-PTAS produces good schedules, but more slowly 
than other algorithms which produce better schedules. However, by carefully 
selecting ideas from we find new heuristic algorithms that run quickly while 
competing favorably with existing algorithms. 

The idea of increasing the release dates of large jobs to some fraction of their 
processing times enables the SPT-h algorithm to perform much better than SPT 
alone. For large instances, the makespan of the schedule reaches a threshold after 
which the order of the jobs has little effect on the TCT. For S-PTAS, this means 
the limit on the number of jobs scheduled with SPT can be ignored or weakened 
in practice. By adjusting the values of e and k, S-PTASffc becomes an adaptive 
algorithm that can find near-optimal solutions for a wide range of instances. 

We also learned some lessons in implementing approximation schemes in 
general. In S-PTAS, the long running time came from pessimistic restrictions on 
the number of jobs that could be scheduled before having to revert to an enu- 
merative algorithm. S-PTAS had to be designed to work for all instances, but 
many instances can be scheduled well with weaker restrictions. A good imple- 
mentation of S-PTAS needed flexibility in ignoring or relaxing some restrictions 
when possible. Allowing optimal rounding and extended SPT scheduling within 
S-PTAS achieved this goal. In general, this process of relaxing restrictions which 
were necessary to handle rare or extreme problem instances may be useful for 
approximation schemes in other domains. 
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Abstract. 0 Dynamic tables that support search, insert and delete op- 
erations are fundamental and well studied in computer science. There 
are many well known data structures that solve this problem, including 
balanced binary trees, skip lists and tries among others. Many of the 
existing data structures work efficiently when the access patterns are 
uniform, but in many circumstance access patterns are biased. Various 
data structures have been proposed that exploit bias in access patterns 
to improve efficiency for the operations they support. 

In this paper we introduce a new data structure, the biased skip list 
(BSL), which is designed to work with biased access distributions. Specif- 
ically, given key k, let its rank r{k) be the number of distinct keys ac- 
cessed since the last access to k. BSL enables one to search for k in 
0(logr(fc)) expected time. Insertions and deletions take O {log r max (k)) 
expected time where rmax{k) denotes the maximum rank of k during its 
lifespan. 

Our work is motivated by recent studies on packet filtering and classifi- 
cation where keys have been found to have geometric (or more skewed) 
access probabilities as a function of how recently they have been accessed. 
We demonstrate the practicality of BSL with experiments on real and 
synthetic data with various degrees of bias. 



1 Introduction 

Maintenance of dynamic tables is fundamental to computer science with a rich 
collection of papers and textbook pages devoted to it. Given a set of keys from 
a totally ordered universe, the goal is to build a data structure which supports 
search, insert and delete operations with efficient performance bounds. Three 
popular families of data structures for implementing dynamic tables are bal- 
anced search trees (including the classical examples of AVL-trees, red-black trees 
and 2-3 trees EH), tries CD], and skip lists m- These data structures provide 
efficient operations and work well when the access probabilities do not vary sig- 
nificantly across the keys. However, when access patterns are biased, it is possible 

^ Part of this work will appear at INEOCOM 01. 



A.L. Buchsbaum and J. Snoeyink (Eds.): ALENEX 2001, LNCS 2153, pp. 216-[^^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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to improve the efficiency of the operations. In this paper we describe a new data 
structure, the biased skip list, which is designed to exploit bias in access patterns. 

Our new data structure is based on skip lists. Skip lists are simple, random- 
ized data structures with small constant factor overheads. Given a set of n keys 
in sorted order, a skip list can be constructed in 0{n) expected time. Search, 
insert, and delete operations each take O(logn) expected time. Skip lists are 
often favored over balanced search trees, whose operations have similar asymp- 
totic running times, but suffer worse performance in practice due to the overhead 
of complex balancing operations. This is especially true when the key distribu- 
tion is non-uniform over the universe. Since the performance of tries depends 
(logarithmically) on the universe size, skip lists can also be favored over tries, 
especially when the universe size is large. 

There are many circumstances under which access patterns show consider- 
able bias. It is well known that locality of reference results in high skew in mem- 
ory/cache |2D|, disk m and buffer 0 management for which a least-recently- 
used (LRU) policy yields good results. Recent studies on table lookups for Inter- 
net packet filtering and classification mm demonstrate similar skew. In such 
applications, the access probability of a key k, decreases exponentially (or faster) 
with its LRU rank, written r{k), which is the number of distinct keys accessed 
since the last access to k. Such bias is observed in many other types of data 
for which move-to-front based data compression Pj is very effective; examples 
include source code compression ca, index compression m and compression of 
text after Burrows- Wheeler transform 

Several special purpose data structures have been proposed that exploit skew 
in access distribution, such as biased search trees |2| , splay trees m, treaps CHI 
and variants of tries . Even though skip lists possess many desirable properties 
and have several advantages over the alternatives, we are not aware of any work 
that modifies skip lists to take advantage of such biasQ 

Our new data structure, the biased skip list (BSL) improves the throughput 
of skip lists in the presence of high locality of reference in access patterns. More 
specifically, BSL enables one to search for key k in 0(logr(/c)) expected time. 
This implies that if the access requests are generated by an ergodic random 
source, BSL’s search time matches the entropy of the source |31 and is therefore 
optimal in the information theoretic sense. Note that when the bias is low (or 
non-existent), BSL’s performance is similar to that of a regular skip list. 

Similar to the best performing data structures for dynamic table lookup, (in- 
cluding biased search trees, splay trees and treaps), one can insert or delete a 
key k to BSL in O(logn) expected time. We show a variant of the BSL that uses 
a lazy updating scheme to improve this running time to O (log r max (k)), where 
T'max(k) denotes the maximum rank that k attains during its lifetime. There are 
many scenarios where improved insertion and deletion times can lead to con- 
siderably higher throughput. One such application is the maintenance of web 
proxy caches, where a small number of pages (e.g. stock news) stay very “hot”. 



^ One exception is the work of cni which attempts to optimize skip lists when access 
probabilities are static and given in advance. 
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i.e. keep a small rank, throughout their lifespan but are short lived. Other pages 
(e.g. company web pages) may have less popularity but do live longer. See Q for 
empirical evidence and discussion. Another application is firewall maintenance, 
where one needs to cater to connections with different characteristics, such as 
HTTP, FTP, and telnet. High bias in insertions and deletions for firewall main- 
tenance and other Layer 4, IP related applications is supported by empirical 
evidence II 211 1151 . 

Another variant of BSL, is to enhance performance in the presence of inde- 
pendently skewed access patterns, where access probabilities to keys are biased 
but relatively stable over time. When the access probability p{k) of each key k 
is i.i.d., this variant of BSL facilitates accessing a key k in optimal 0(log 
time. Insertion and deletion times are O(logn). Further discussion of this variant 
of BSL is left until the full version of this paper. 



Organization of the Paper. We start with a summary of skip lists. In Section 0 
we describe BSL and its operations. Following this. Section 0 shows how to 
modify the basic BSL to improve insert and delete times. Section 0 demonstrates 
the practicality of BSL with experiments on synthetic data and data from real 
applications. Finally we end with some conclusions and discussion of further 
work. 



1.1 Preliminaries 

Given n keys from an ordered universe, a skip list can be constructed in 0{n) 
expected time if the keys are given in sorted order. Search, insert, and delete 
operations each take O(logn) expected time. To construct a skip list, we start 
with a linked list of all k^s in the data structure in sorted order; this makes up 
level log n of the skip list B We iteratively build subsequent levels log n— 1, . . . , 2, 1 
in a randomized fashion. Each level f is a sorted linked list consisting of a subset 
of the keys in level i -I- 1 obtained as follows: each key in level * -I- 1 is copied 
to level i independently with probability 1/2. Each key has links to its copies 
(if they exist) on adjacent levels. Because there are logn levels, the expected 
number of keys in level 1 is only 1 . 

To search for a key fc in a skip list, we start from the smallest (leftmost) key 
in the highest level. On each level, we follow the linked list to the right until we 
encounter a key which is greater than k. When that happens, we take a step to 
the left and go down one level (this leaves us at a key which is less than k). The 
search ends when k is found or when a key greater than k is found in the lowest 
level. To delete a key, we perform a search to find its highest occurrence in the 
structure and delete it from every level it appears in. To insert a new key k, we 
first search for k in the skip list. This search locates the correct place to insert k 
in each level. Key k is inserted into the bottommost level and a fair coin is tossed 
to determine how many subsequent levels k is copied to. On a successful coin 
toss, k is copied up one level and this process continues until the first failure. 

Throughout the paper log n denotes [log 2 n] . 



3 
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2 Biased Skip List (BSL) 

In this section we present a version of BSL which has an 0(logr(/c)) expected 
search time for a key k, and has O(logn) expected insert and delete times. A 
modification that allows more efficient inserts and deletes is described later in 
the paper. 

The biased skip list is an extension to a regular skip list in the following 
way. Each key maintained by BSL is given a unique rank in the range [1 . . . n] 
and is denoted r{k). BSL keeps all keys in a linked list in ascending rank order. 
We partition this list into classes Ci, C 2 , . . . , Ciogn contiguously, where the class 
sizes are geometric witI0 m = 

Construction. BSL is constructed randomly in a bottom-up fashion similar to 
skip lists. It comprises of several levels, each one being a doubly linked list of 
keys in sorted order. The levels of the data structure are labelled Liogn, -^logn-u 
Liogra-i, ■ ■ ■ , L 2 , L'l, Li. The bottommost level, Liogm includes the keys from all 
classes. To obtain the keys in any level L', we copy the following keys from level 
(1) all keys from classes (7^, Ci-i ...,(71 (by definition all should be present 
in Li+i) - there will be 2* — 1 of them, and (2) a subset of the remaining keys 
in picked independently, each with probability 1/2 - the expected number 
of these keys will be 2*“^. We obtain the keys in level Li slightly differently by 
copying the following keys from L': (1) all keys from classes (7i_i, Ci -2 . ■ . , (7i, 
and (2) a subset of the remaining keys in L' picked independently, each with 
probability 1/2. See Figure ^for an example. Notice that a key k in class Ci, is 
automatically copied from the bottom to all levels up to and then randomly 
copied to higher levels. 



Level 




4 



In fact, with appropriate modifications, class sizes may follow any geometric se- 
quence. 
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Lemma 1. A BSL with n keys can be constructed in 0{n) expected time, pro- 
vided the keys are given in sorted order. 

Proof. Let us look at how many times an element will be copied up. Since the 
copying probability is 1/2, the expected number of levels on which an element 
will be copied as a result of a coin toss is 1 . Let us count the automatically made 
copies. Ciogn, which is of size (n + l)/2, is not subject to automatic copying. 
Ciogra-i, which is of size (n + l)/4, is copied automatically on to two levels. In 
general, C\ogn-p gets copied on to 2p levels. Summing up over all elements, the 
total number of automatically made copies is 0{n); combined with the randomly 
made copies, the total construction time is 0{n). □ 

Search. Searching for a key A: in a BSL is similar to that in a regular skip list. 
We start from the smallest key in the topmost level. On each level, we follow the 
linked list to the right until we encounter a key which is greater than k. Then 
we take a step back (to the left) and go down one level. We end the search either 
when we find k, or when we reach a key greater than k in the bottommost level, 
which indicates that k is not present in the data structure. 

If k is found then some adjustments to the data structure are necessary to 
reflect the changes in rank. The rank of k becomes I and thus it is moved to the 
front of the rank ordered linked list. If necessary, extra copies of k are inserted 
in levels up to L[. If k was in class Cc before the search, then the last (by 
rank ordering) key in each class Ci, for i < c, will move to class Ci+i. For each 
such key, two of the automatically made copies must be removed so that its 
automatically made copies stop at level L'_|_i. 

Lemma 2. A successful search for an element k in a BSL of n keys takes 
0{logr{k)) expected time, whereas an unsuccessful search takes O(logn) expected 
time. 

Proof. First, consider a successful search. Due to the partitioning into classes, 
key k belongs to class Cc, where c < log r(/c) + 1. To bound the search time, 
we bound the horizontal and vertical distance that we travel until we encounter 
k. The construction method for the BSL dictates that all the keys in Class Cc 
are present on level L/, which is at depth 2c from the top of the BSL. Once 
we reach level L/, we are guaranteed to And k, thus we travel at most 2c — 1 
vertical links from the top of the BSL. We now analyze how many (expected) 
horizontal links are traversed per level. On the top level, there is an expected 
single key, corresponding to at most 2 links. Let us now consider our actions on 
level 1. We must have come to level I from the level above, say I' , from some key 
k' . Let k" be the key immediately to the right of k' on level I'. By our search 
strategy, k' < k < k” . Thus, on level I, in the worst case, we need to traverse 
the links between k' and k", which number one more than the number of keys 
on level I between k' and k” not copied up to level I'. In a situation where all 
keys are always subject to the randomized process to decide whether they will 
be copied up, the expected number of keys that are skipped between two keys 
that are copied is 1. In our scheme, since some keys get copied automatically. 
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the expected number is even less. Thus, the expected number of links on level I, 
between k' and k” , is less than 2. Since we visit at most 2c— 1 levels, traversing at 
most an expected 2 links on each level, the total (expected) running time to find 
k in the data structure is 0(c); substituting the value for c, this is 0(logr(A:)). 
The rank adjustments to the data structure can also be done in 0(log r{k)) time. 
To see this, observe that at most 2c — 1 extra copies of k must be made and 
c — 1 keys (one from each class Ci, i < c) must be removed from 2 levels each. 
Therefore the total running time for a successful search is expect 0(logr(fc)). 

For an unsuccessful search, the number of horizontal links traversed per level 
stays the same; however, one has to go all the way down to the bottom of 
the data structure. Therefore, the vertical distance traveled is O(logn) levels, 
with an expected constant number of links per level, giving an overall expected 
running time of O (log n). □ 

Insertion and deletion. To insert a new key k, we first search for k knowing it will 
not be found. This locates the correct place to insert k in each level. Copies of k 
are made in each level up to and k is added to the front of the rank ordered 
list. One key from each class is removed from 2 levels in the data structure to 
update the new rank ordering, in the same fashion as for a search. This can be 
done in O(logn) expected time. 

To delete a key k from BSL, we follow the search procedure, removing any 
copies of k we find. In this case the first (by rank ordering) key in each class that 
is numbered higher that the class of k moves to the next lower numbered class. 
Such keys are copied up two further levels in the data structure to reflect the 
new rank ordering. Again, this can be accomplished in O(logn) expect time. 

3 Improving the Efficiency of BSL 

BSL, as described above, performs well on searches, however, keys with small 
rank have many copies, which makes insertion and deletion less efficient. For 
instance, when we insert a key with rank 1, we make O(logn) copies of the key 
- one per level. If we know that the key will always stay close to the top of the 
structure, it is wasteful to copy it all the way to the bottom. To improve this, 
BSL can be modified so that elements in class Ci are not automatically present 
in all the levels below L', but are randomly copied down, the same way they are 
randomly copied up. Notice that not copying down keys from L' at all can cause 
gaps especially when keys in the same class are are grouped together in terms 
of key values; this would affect efficiency. This modification improves the time, 
as well as the space efficiency of the data structure. 

Construction. We start with the bottommost level Liogn and iteratively build 
up. Level Liogn contains all keys that belong to class C\ogn- For i < logn, an 
upper level L' is constructed by copying those keys of level Li+i that are chosen 
independently with probability 1/2. L' also includes all keys in class Ci] these 
are called the default keys of L' . To facilitate efficient search some of the default 
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Level 




keys of Lj are copied to the lower level Li+i. This is done independently with 
probability 1/2. Each key copied to level Ti+i may further be copied to lower 
levels . . . with independent coin tosses. Once we have the keys 

in level L', we construct level Li by copying those keys in level L[ which are 
chosen independently with probability 1/2. For an example see Figure |3 

Lemma 3. The improved BSL with n keys ean be eonstrueted in 0(n) expeeted 
time, provided the keys are given in sorted order. 

Proof. The probabilistic copying down of keys does not add significantly to the 
number of keys per level since the expected number of keys copied down decreases 
faster than the expected number of keys on each level as we go down. Thus, we 
still maintain that any level Li or L' will contain 0(2*) keys. We examine the 
cost per level. Level L[ is formed from the previous level and the class Ci in time 
0{\Ci\). Level Li is formed from L', again in 0(|0i|) time. The only remaining 
cost is the copying down of keys originating from L'. Note that, the expected 
number of times a key will be copied down is just under 1. Thus, the expected 
total number of copies from Ci that will lie below L' is 0(|0i|). To copy a key 
down, we go to its left neighbor on the same level and attempt to go down or keep 
going left until we can go down one level. After that, we go right until we come 
to a key greater than the one we need to copy, and insert our key before it. The 
expected number of steps that we take to the left is 0(1), since the probability 
that a key on this level will not exist on the previous level is less than 1/2. With 
a similar argument, the expected number of steps that we take to the right is 
0(1). Thus, we spend 0(1) time copying one key down, and 0(|0i|) time taking 
care of the copying of the entire Ci. All the costs associated with levels L' and 
Li hence add up to 0(|0i|). The sum of these costs over all i gives us 0(n). □ 

Search. To search for a key k, we start from the smallest key in the topmost 
level. On each level, we follow the linked list to the right until we encounter a 
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key j which is greater than or equal to k. If j = k, we successfully terminate 
the search. Otherwise we go left until we reach a key I that has a copy one level 
below 0 Then we simply go one level down and iterate starting from the copy of 
I at that level. If the level is the bottommost one and while going right we have 
encountered a key which is larger than k, then we conclude that k is not present 
in the BSL, and terminate. Any rank adjustments that must be made after a 
key is found are done similarly to those described for searches in Section ^ 

3.1 Lazy Updating 

To facilitate efficient implementation of insertions and deletions, we employ a 
lazy updating scheme of levels and allow flexibility in class sizes. The size of class 
Ci is allowed to be in the range while its default size remains 2*“^. 

The lazy updating allows us to keep an insertion or deletion local to upper levels 
for most insertions and deletions. As a result, the insertion or deletion of a key 
k takes 0(logrmax{k)) time, where rmax{k) is the maximum rank of key k in its 
lifespan. 

Search. In accordance with how LRU ranks are assigned, when a key k is ac- 
cessed its rank becomes I and the ranks of all keys whose rank were smaller are 
incremented by one. The rank changes are reflected in the BSL without chang- 
ing the number of keys in any class or level; class sizes may only change after 
an insertion or a deletion. This enables BSL to search for an existing key k in 
0(logr(fc)) expected time. Searching for a key which is not in the data structure 
takes O(logn) expected time, as before. See ^ for the full description and proof. 

Insertion and deletion. When a key k is inserted to the BSL it is given rank 1 
and the ranks of all keys in the data structure are incremented by one. After an 
insertion, if the size of a class Ci reaches its upper limit, i.e. 2®, then the highest 
ranked half of the keys in Ci change their default level from L' to L'+i. This 
is done by moving the topmost and bottommost levels of each such key by two. 
One can observe that such an operation can be very costly; if, for example, all 
classes Ci,C 2 , . ■ . ,Ci are full, an insertion will change the default levels of 1 key 
in class C\, 2 keys in class C 2 , and in general 2*“^ keys from class Ci. However 
it is possible to amortize more costly insertions to less costly ones: we show that 
the average insertion time for a key k is 0{logrmax{k)), where rmax{k) denotes 
the maximum rank of k in its lifespan. 

To delete a key k we first search for k in BSL, which will change its rank to 1 
and simply delete it from all the levels it has copies. If the number of keys in class 
Ci after the deletion is above its lower limit of 2*“^ then we stop. Otherwise, 
we go to the next class Ci+i and move its lowest ranked 2*“^ keys to class Ci 
by moving their default levels up by two. Similar to insertion it is possible to 
amortize more costly deletions to less costly ones; the main result in this section 
is that the average deletion time for a key k is 0{logrmax{k)). 



® Remember that a key at level Lj does not necessarily have a copy in level Li+i. 
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The BSL implementing MTF enables insertion or deletion of a key k in 
0(logrmax{k)) amortized expected time, where Vmaxik) is the maximum rank 
of key k during its lifespan. See for the full description and proof. 

4 Experimental Results 

In order to evaluate the performance of BSL in practice, we provide a compar- 
ative, experimental study of BSL implementations and other data structures 
using data from real applications and synthetic data with varying degrees of 
bias. The architecture used in these experiments has an AMD K6-2 300MHz 
processor with 64M RAM, 32K level 1 cache (on the CPU) and 512K level 2 
cache. 

Our experiments on synthetic data compare search times when the keys are 
accessed with a geometric distribution on LRU ranks. At each step of the simu- 
lation, a key k with rank r{k), was accessed with probability P{k) = 
where p is the bias parameter. By varying p we changed the average rank of the 
keys that are accessed and hence simulated the “hot” working set phenomenon 
observed in packet traces |t)l 1 2j . It was noticed in many contexts 1 1 1 1 12j that the 
probability of accessing a key drops exponentially with the time the key is kept 
inactive; thus a geometric distribution provides appropriate means for modeling 
the bias in bursty access patterns. 

Our experiments on real data use the publicly available LBL trace data [Z|. 
This trace contains full address lookups to 1.8 million TCP packets flowing 
between the Lawrence Berkeley Labs firewall and the rest of the Internet. Notice 
that in firewall applications one usually needs to search for full IP addresses (and 
port numbers) rather than address prefixes. 

There are a number of parameters which affect the practical performance of a 
BSL implementation. For example, the size of the topmost class Ci can be tuned 
to capture the bias in the data, “optimizing” BSL’s performance. Figure 0shows 




Fig. 3. Search times for different class 1 sizes - 10® searches on 98K keys 
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how the size of class Ci affects the search times for accesses following geometric 
distributions with different p values. An interesting observation is that with 
increasing bias (i.e. the average rank of the searched keys), the optimal size for 
class Cl increases. For example, when the average rank is 10 the optimal size 
of Cl becomes 16, when the average rank is increased to 10^ = 100 the optimal 
size of Cl increases to 16^ = 256. 

The summary of our experiments on simulated data with geometric access 
distribution on LRU ranks, is provided in Figure 0 Here we compare perfor- 
mance of several variants of BSL with that of regular skip list^ and a simple 
implementation of move-to- front (MTF) via a link list. The choice of these data 
structures for our experiments is as follows. On unbiased or slightly biased data, 
the regular skip list is often considered to be the best performing search data 
structure. Thus, it provides excellent means to measure the performance of BSL 
on data with low bias. An MTF list, on the other hand, is a very simple data 
structure which provides the best performance on highly biased data. Although 
it has a worst case search performance of 0(n) (which is unacceptable for many 
applications, especially if searches that fail to find a key occur often), it provides 
a very good benchmark on how well BSL evaluates on highly biased distributions. 




0 50 100 150 200 250 300 350 400 450 500 

Average rank 



Fig. 4. Search times on geometrically distributed data with varying biases; a total of 
10® searches were performed on 198K keys. 



The first observation we make is that the BSL implementations consistently 
outperform the regular skip lists on all degrees of bias we tested, although we can 
see the skip list search times begin to plateau before BSL. This shows that the 
added complexity in maintaining multiple classes in BSL is more than compen- 
sated by the gains in efficiency with biased accesses. Although it is conceivable 
that skip lists may perform slightly better than BSL on even lower biases than 



The implementation we used was written by William Pugh and is available at 
ftp : / /ftp . cs .umd.edu/pub/skipLists/ 
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Fig. 5. Comparison of BSL (|Ci| = 8) and MTF list performances with disabled caches 
on geometrically distributed data; a total of 10® searches were performed on 98K keys. 



we tested, we can safely predict that for all biases of practical interest in the 
packet filtering and classification applications we consider, BSL is a favorable 
choice. 

Another observation is that BSL implementations outperform the MTF lists 
by at least one order of magnitude when the bias is low. However MTF lists 
perform better with high bias. Although this phenomenon is partially due to 
the simplicity of MTF lists we suspected that an even bigger factor is the cache 
effects: while running BSL or the regular skip list implementation, it is expected 
that the caches need to keep some of the intermediate nodes that are encountered 
during the searches, losing valuable space; this is not the case for MTF lists. 
To check our hypothesis and also see how well BSL would compare to MTF 
lists in next generation architectures, where it would be possible to place the 
whole data structure in an on-chip memory, we tested both BSL and MTF list 
implementations on the same data after disabling caches. Results of this test 
are summarized in Figure where one can observe that BSL catches up with 
the performance of MTF lists much earlier. Yet, the initial gains of MTF lists 
suggest it can be combined with BSL in a hybrid scheme where a short MTF list 
can act as an initial filter before a search is forwarded to BSL; we implemented 
and used such a hybrid data structure in our tests. 

Three different BSL data structures were included in our experiments. One 
uses 8 for the size of class C\ and another used a size of 512. We see that setting 
\Ci\ = 8 works well for highly biased data, but the overhead of maintaining 
small classes gives worse performance for accesses with smaller biases. Using 
ICil = 512 gives better performance over a larger range and shows improvement 
over regular skip lists beyond a working set size of 500. The performance of a 
hybrid scheme where keys are initially searched in a move-to- front linked list can 
also be found in the plots. We used a list of 50 keys before resorting to regular 
BSL with 512 keys in C\ . 
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To make sure that we are competitive with other dynamic table lookup meth- 
ods, we compare the performance of BSL with simple binary tries. Figure El 
demonstrates that BSL outperforms tries on 64-bit keys throughout the whole 
range of biases we considered (for n = 98K); it also performs favorably on 32 and 
48 bit keys for biases observed in practice. We are going to include performance 
comparisons of BSL with balanced binary trees in the full version of the paper. 




Average rank 



Fig. 6. Comparison of a hybrid BSL (with an MTF list of size 50) and binary trie 
performances on synthetic data with varying universe sizes. A total of 10® searches 
were performed on 98K keys. 



Our final experiments compare the data structures mentioned above using 
the LBL trace data. For ease of implementation the flows are initially identified 
and mapped to a unique integer key in a preprocessing phase. Since the LBL 
trace data already uses modified IP addresses for privacy reasons and the data 
structures in question should not be effected by the values of the keys this should 
not change the results of the test. The trace uses approximately 1.8 million 
packets and includes around 15 thousand insert operations. The average rank or 
the searched keys is approximately 25. The timing results of the data structures 
considered in our study on the LBL data are summarized in Table E Given the 
high bias of the data and the results shown by previous experiments, we would 
expect BSL to perform better when compared with skip lists. We explain this 
as follows. The total number of keys in the trace data build gradually to the 
relatively low total of around 15 thousand. This is favorable to the O(logn) 
search and insert times for skip lists. Additionally BSL has a higher overhead 
during insertion. We hope the modified BSL proposed in Section El will improve 
this. The results also show that MTF lists process the data quickest. However, 
if the data produced more failed searches the time would dramatically increase. 
For Internet applications in particular, this is not acceptable since it could be 
exploited in denial of service attacks. The hybrid BSL performs competitively, 
suggesting it may prove useful for applications that deal with biased data. 
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Table 1. Running times for the LBL trace data. 



Data Structure 


Time (secs) 


Skip List 


4.4 


MTF List 


1.97 


BSL |Gil = 8 


7.23 


BSL Gil = 512 


5.87 


BSL-|-mtf(length 50) |Gi| = 512 


3.3 


BSL-|-mtf(length 256) |Gil = 4096 


2.61 



5 Conclusions 

We introduced a new alternative to the data structures for maintaining dynamic 
tables, which is designed for biased access patterns. Its adaptive nature captures 
changes to the access distribution and keeps the search times efficient, even when 
the frequently accessed keys change. The improvements described in Section 0 
offer fast insert and delete operations for applications where the lifetime of keys 
is short. 

Preliminary experimental results suggest BSL may be a suitable choice for 
highly biased, dynamic data. The results show that BSL offers fast access to 
frequently accessed keys, while providing efficient access times for infrequently 
accessed ones. There are many parameters that affect the practical performance, 
such as the size of class Ci and the length of the initial MTF list in the hybrid 
BSL. A more sophisticated implementation could dynamically choose the most 
suitable values based on the recent access patterns. This would further increase 
that adaptive nature of BSL and expand the range over which BSL is competi- 
tive. We leave such a study for future work. More experiments are required that 
compare BSL with the alternatives in order to fully evaluate its practicality. In 
addition, more tests using data from real applications are needed to help quantify 
any improvements. 
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